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Abstract 

Background: Random biological sequences are a topic of great interest in genome 
analysis since, according to a powerful paradigm, they represent the background 
noise from which the actual biological information must differentiate. Accordingly, 
the generation of random sequences has been investigated for a long time. Similarly, 
random object of a more complicated structure like RNA molecules or proteins are 
of interest. 

Results: In this article, we present a new general framework for deriving algorithms 
for the non-uniform random generation of combinatorial objects according to the 
encoding and probability distribution implied by a stochastic context-free grammar. 
Briefly, the framework extends on the well-known recursive method for (uniform) 
random generation and uses the popular framework of admissible specifications of 
combinatorial classes, introducing weighted combinatorial classes to allow for the 
non-uniform generation by means of unranking. This framework is used to derive an 
algorithm for the generation of RNA secondary structures of a given fixed size. We 
address the random generation of these structures according to a realistic 
distribution obtained from real-life data by using a very detailed context-free 
grammar (that models the class of RNA secondary structures by distinguishing 
between all known motifs in RNA structure). Compared to well-known sampling 
approaches used in several structure prediction tools (such as SFold) ours has two 
major advantages: Firstly, after a preprocessing step in time 0(n 2 ) for the 
computation of all weighted class sizes needed, with our approach a set of m 
random secondary structures of a given structure size n can be computed in worst- 
case time complexity O (m • n • log(n)) while other algorithms typically have a 
runtime in G{m • n 2 ). Secondly, our approach works with integer arithmetic only 
which is faster and saves us from all the discomforting details of using floating point 
arithmetic with logarithmized probabilities. 

Conclusion: A number of experimental results shows that our random generation 
method produces realistic output, at least with respect to the appearance of the 
different structural motifs. The algorithm is available as a webservice at http:// 
wwwagak.cs.uni-kl.de/NonUniRandGen and can be used for generating random 
secondary structures of any specified RNA type. A link to download an 
implementation of our method (in Wolfram Mathematica) can be found there, too. 

Keywords: Random generation, stochastic context-free grammars, RNA secondary 
structures 
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Background and Introduction 

The topic of random generation algorithms (also called samplers) has been widely 
studied by computer scientists. As stated in [1], it has been examined under different 
perspectives, including combinatorics, algorithmics (design and/or engineering), as well 
as probability theory, where two of the main motivations for random sampling are the 
testing of combinatorial properties of structures (e.g. conjectured structural properties, 
quantitative aspects), as well as the testing of properties of the corresponding algo- 
rithms (with respect to correctness and/or efficiency). 

As considers software engineering, the so-called random testing approach is com- 
monly used to test implementations of particular algorithms, as it is usually not feasi- 
ble to consider all possible inputs and unknown which of these inputs are among the 
most interesting ones. In fact, this approach requires for the generation of random 
instances of program inputs that obey various sorts of syntactic and semantic con- 
straints (where the random instances usually ought to be of a preliminarily fixed input 
size in order to be comparable to each other). 

In the Bioinformatics area, algorithms for generating random biological sequences 
have been investigated for a long time (see e.g. [2,3]). As stated in [4], random sequences 
are a topic of great interest in genome analysis, since according to a powerful paradigm, 
they represent the background noise from which the actual biological information must 
differentiate. Thus, random generation of combinatorial objects can be used in this con- 
text for simulations studies in order to isolate signal (unexpected events) from noise 
(statistically unavoidable regularities). In fact, according to [4], random biological 
sequences are for instance widely used for the detection of over-represented and under- 
represented motifs, as well as for determining whether scores of pairwise alignments are 
relevant or not: although there exist analytic approaches for these kinds of problems, for 
the most complex cases, it is often still necessary to be able to alternatively use a corre- 
sponding experimental approach (based on randomly generated sequences obtained 
from a computer programm). For this purpose, random sequences must obviously obey 
to a certain model that takes into account some relevant properties of actual real-life 
sequences, where such models are usually based on statistical parameters only. However, 
it is known that these classical models can be enriched by adding structural parameters 
(see [4]). Over the past years, several methods have been proposed for the random gen- 
eration of more complex structures, where special attention has been paid to RNA sec- 
ondary structures, RNA is a single-stranded nucleotide polymer and a major component 
of cellular processes (like DNA and proteins). An RNA strand is formed by linking 
together certain nucleotide units. The specific sequence of nucleotides along this chain 
is called the primary structure of the molecule. By pairing of nucleotides that are not 
linked in this chain (i.e. by the so-called effects of base pairing), the linear primary struc- 
ture is folded into a three-dimensional conformation, called the tertiary structure, which 
in many cases determines the function of the molecule. Most of the 3D structure is 
determined by the intramolecular base-pairing interactions in the plane, which together 
form the secondary structure of the molecule. For this reason, pseudoknots (induced by 
crossing base pairs) are considered as tertiary interactions and are usually not permitted 
in the definition of secondary structure. As unknotted structures contain only nested 
base pairs and are thus essentially two-dimensional, they can be modeled as planar 
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graphs. This rather descriptive and commonly used planar graph model for RNA sec- 
ondary structures was first formalized in [5], An example is shown in Figure 1. 

Most of the existing random generation algorithms for RNA secondary structures are 
used for predicting the structure of a given RNA sequence (see e.g. [6,7]), while others 
can be employed for instance for evaluating structure comparison softwares [8]. Note 
that secondary structure prediction methods based on random sampling represent a 
non-deterministic counterpart to the up-to-date most successful and popular physics- 
based prediction methods that make use of the energy minimization paradigm and are 
realized by dynamic programming algorithms (see e.g. [9-12]). Random sampling also 
differs from the stochastical RNA structure prediction approach that is based on con- 
text-free modeling of structural motifs and adding some statistical parameters observed 
in real-life data by assigning probabilities to the corresponding motifs (see e.g. [13-15]). 
Nevertheless, it should be mentioned that statistical sampling methods like [6,7] used 
for RNA structure prediction are based on thermodynamics and thus inevitably inherit 
the problems and imprecisions related to energy minimizing methods, which are 
caused by the still incomplete commonly used free energy models for RNAs. In order 
to overcome these pitfalls, one could take the competing point of view and consider 
only typical structural information observed in a set of sample data as the basis for a 
new random generation method. If that information draws a realistic picture for all the 
different motifs of a molecule's folding, the corresponding sampling method is likely to 
produce realistic results. Accordingly, several authors made use of stochastic context 
free grammars and employed machine-learning techniques to train parameter values 
from a set of known secondary structures. Such grammars have widely been used in a 
predictive mode (see, e.g., [14]) but there are also successful examples of applications 
where the random sampling of derivation trees has been the core of the method (see, 
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e.g., [16-18] but also [19] for examples). In the present paper, we follow that line of 
ideas and rely on the approach of the technical report [20] to develop a new algorithm 
for the (non-uniform) random generation of RNA secondary structures (without pseu- 
doknots) according to a distribution induced by a set of sample RNA data (note that 
the algorithm actually generates secondary structures for a preliminary fixed size, not 
for a given RNA sequence of this size, which means we take the combinatorial point 
of view and completely abstract from sequence). 

The main contribution of this manuscript is the derivation of a new and efficient 
algorithm for the random generation of RNA secondary structures according to an ela- 
borate and thus very realistic model. For this purpose we use and generalize the 
approach from [20]. Particularly, our random generation method is based on a sophis- 
ticated context-free grammar for unknotted structures which, in order to model the 
class of all considered RNA secondary structures as realistic as possible, distinguishes 
between all known structural motifs that may occur in unknotted RNA secondary 
structure. This means that any structural feature is modeled by one or more specific 
grammar rules with corresponding probabilities observed from real-life data. Note that 
this grammar is actually a special variant of the comprehensive grammar used in [21] 
for deriving a realistic RNA structure model and for performing the first ever analytical 
analysis of the expected free energy of a random secondary structure (of a specified 
RNA type). Actually, that grammar has been designed as a mirror of the famous 
Turner energy model [22,23] which serves as the foundation for most of the existing 
physics-based RNA structure prediction methods: all structural motifs for which there 
are different thermodynamic rules and parameters are created by distinct production 
rules (with corresponding probabilities). 

According to [20] , our sampling method involves a weighted unranking algorithm for 
obtaining the final structures. Briefly, considering an arbitrary structure class of size 
(cardinality) c, a corresponding unranking method uses a well-defined ordering of all 
class elements (according to a particular numbering scheme, the so-called ranking 
method) and for a given input number r e {1,..., c] outputs the structure with rank r in 
the considered ordering. That way, the random sampling based on a stochastic grammar 
- building heavily on the use of small floating point numbers - is translated into an 
unranking algorithm using integer values only. Notably, a complete structure of size n is 
generated by recursively unranking the distinct structural components from the corre- 
sponding subclasses (of substructures with sizes less than n). In our case, the weighted 
unranking algorithm requires a precomputation step in worst-case time 0(n 2 ) for com- 
puting all weighted class sizes up to input size n. The worst-case complexity for generat- 
ing a secondary structure of size n at random is then given by G[n log n) since we are 
ranking structures according to the boustrophedon order (see e.g. [7]). 

By the end of this paper, we analyze the quality of randomly generated structures by 
considering some experimental results. First, we will consider statistical indicators of 
many important parameters related to particular structural motifs and compare the 
ones observed in the used sample set of real world RNA data to those observed in a 
corresponding set of random structures. Their comparison measures indicate that our 
method actually generates realistic RNA structures. Obviously, an algorithm which, for 
a given structure size n, produces random RNA secondary structures that are - related 
to expected shapes of such structures -in most cases realistic is a major improvement 
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over existing approaches which, for example, are only capable of generating secondary 
structures uniformly for size n. Furthermore, we will consider the two different free 
energy models defined in [21] for RNA secondary structure (with unknown RNA 
sequence) to get further evidence of the good quality of our random generation 
method (with respect to free energies and thus rather likely also with respect to 
appearance of the different structural motifs of RNA). 

Prior Results and Basic Definitions 

Uniform Random Generation 

In the past, the problem of uniform random generation of combinatorial structures, 
that is the problem of randomly generating objects (of a preliminary fixed input size) 
of a specified class that have the same or similar properties, has been extensively stu- 
died. Special attention has been paid on the wide class of decomposable structures 
which are basically defined as combinatorial structures that can be constructed recur- 
sively in an unambiguous way. 

In principle, two general (systematic) approaches have been developed for the uniform 
generation of these structures: First, the recursive method originated in [24] (to generate 
various data structures) and later systematized and extended in [25] (to decomposable 
data structures), where general combinatorial decompositions are used to generate 
objects at random based on counting possibilities. Second and more recently, the so- 
called Boltzmann method [1,26], where random objects (under the corresponding Boltz- 
mann model) have a fluctuating size, but objects with the same size invariably occur 
with the same probability. Note that according to [26], Boltzmann samplers may be 
employed for approximate-size (objects with a randomly varying size are drawn) as well 
as fixed-size (objects of a strictly fixed size are drawn) random generation and are an 
alternative to standard combinatorial generators based on the recursive method. How- 
ever, fixed-size generation is considered the standard paradigm for the random genera- 
tion of combinatorial structures. 

(Admissible) Constructions and Specifications 

According to [25], a decomposable structure is a structure that admits an equivalent 

combinatorial specification: 
Definition 0.1 ( [25]). Let A = {A\, ...,A r ) be an r-tuple of classes of combinatorial 

structures. A specification for A is a collection or r equations with the ith. equation 

being of the form A\ = <pi (A\, ... f A r ) f where (p t denotes a term built of the Aj using 

the constructions of disjoint union, cartesian product, sequence, set and cycle, as well 

as the initial (neutral and atomic) classes. 
The needed formalities that will also be used in the sequel are given as follows: 
Definition 0.2 ( [27]). If A is a combinatorial class, then A n denotes the class of 

objects in A that have size (defined as number of atoms) n. Furthermore: 

♦ Objects of size 0 are called neutral objects or tags and a class consisting of a sin- 
gle neutral object e is called a neutral class, which will be denoted by s {sy e 2 >... to 
distinguish multiple neutral classes containing the objects e 1? e 2 , respectively). 
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♦ Objects of size 1 are called atomic objects or atoms and a class consisting of a 
single atomic object is called an atomic class, which will be denoted by Z(Z at Z bf ... 
to distinguish the classes containing the atoms a,b,..., respectively). 

♦ If Ai, ...,Ak are combinatorial classes and e lf e k are neutral objects, the combi- 
natorial sum or disjoint union is defined as 
A\ + ... + Ak := (si x Ai) U ... U (8k x Ak) where U denotes set theoretic union. 

♦ If A and B are combinatorial classes, the cartesian product is defined as 
A x B := {(a, p) \a e A and ft e B\ where size(a, /3) = size(a) + size(/3). 

Note that the constructions of disjoint union, cartesian product, sequence, set and 
cycle are all admissible: 

Definition 0.3 ( [27]). Let 0 be an m-ary construction that associates to a any collection 
of classes B\, B m a new class A := 0 \Bi, B m \ The construction 4> is admissible iff 
the counting sequence (a n ) of A only depends on the counting sequences (b lfn ),..., (b m)H ) 
of Bi, B m , where the counting sequence of a combinatorial class A is the sequence of 
integers (a n ) n > 0 for a n = card (A n ). 

The framework of (admissible) specifications obviously resembles that of context-free 
grammars (CFGs) known from formal language theory (note that we assume the reader 
has basic knowledge of the notions concerning context-free languages and grammars. 
An introduction can be found for instance in [28]). In order to translate a CFG into the 
framework of admissible constructions, it is sufficient to make each terminal symbol an 
atom and to assume each non-terminal A to represent a class A (the set of all words 
which can be derived from non-terminal A). However, for representing CFGs, only the 
admissible constructions disjoint union, cartesian product and sequence are needed: 
Words are constructed as cartesian products of atoms, sentential forms as cartesian pro- 
ducts of atoms and the classes assigned to the corresponding non-terminal symbols. For 
instance, a production rule A — > aB translates into the symbolic equation A = a x S- 
Different production rules with the same left-hand side give rise to the union of the cor- 
responding cartesian products. Nevertheless, it should be noted that [25] also shows 
how to reduce specifications to standard form, where the corresponding standard speci- 
fications constitute the basis of the recursive method for uniform random generation 
and extends the usual Chomsky normal form (CNF) for CFGs. Briefly, in standard speci- 
fications, all sums and products are binary and the constructions of sequences, sets and 
cycles are actually replaced with other constructions (for details see [25]). 

The prime advantage of standard specifications is that they translate directly into 
procedures for computing the sizes of all combinatorial subclasses of the considered 
class C of combinatorial objects. This means they can be used to count the number of 
structures of a given size that are generated from a given non-terminal symbol. More- 
over, standard specifications immediately translate into procedures for generating one 
such structure uniformly at random. The corresponding procedures (for class size cal- 
culations and structure generations) are actually required for (uniform) random gen- 
eration of words of a given CFG by means of unranking. 

Simply speaking, the unranking of decomposable structures (like for instance RNA 
secondary structures which can be uniquely decomposed into distinct structural com- 
ponents) works as follows: Each structure 5 in the combinatorial class S n of all feasible 
structures having size n is given a number (rank) i e {0, card (S n ) — l}, defined by a 
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particular ranking method. Based on this ordering of the considered structure class S n , 
the corresponding unranking algorithm for a given input number 
i e {0, card (S n ) — l} computes the single structure s e S n having number i in the 
ranking scheme defined for class <S n . 

Note that in this context of unranking particular elements from a considered struc- 
ture class, the corresponding algorithms make heavy use of their decomposability, as 
the distinct structural components are unranked from the corresponding subclasses. In 
fact, the class sizes can be derived according to the following recursion: 



size(C, n): = 



1 C is neutral and n = 0, 

0 C is neutral and 

1 C is atomic and n = 1, 
0 C is atomic and n=/ 1, 

J2ti size (Ai* n ) C = Ai + ... + Ah 

J2J=o size (A,j) • size (B,n-j)C = Ax B. 



Note that when computing the sums for cartesian products, we can either consider 
the values for j in the sequential (also called lexicographic) order (1, 2, 3,..., n) or in the 
so-called boustrophedon order (l, n, 2, n — 1, |~| ]). In either case, given a fix number 
of considered combinatorial (sub)classes (or corresponding non-terminal symbols), the 
precomputation of all class size tables up to size n requires 0(n 2 ) operations on coeffi- 
cients. One random generation step then needs 0(n 2 ) arithmetic operations when 
using the sequential method and O (n • log(n)) operations when using the boustrophe- 
don method (for details we refer to [25]). Obviously, using uniform unranking proce- 
dures to construct the ith structure of size n for a randomly drawn number i, any 
structure of size n is equiprobably generated. Consequently, in order to make sure 
that, for given size n and a sample set of random numbers z, the corresponding struc- 
tures are in accordance with an appropriate probability distribution (as for instance 
observed from real-life RNA data), it is mandatory to use a corresponding non-uniform 
unranking method or an alternative non-uniform random generation approach. 

Non-Uniform Random Generation 

Coming back to the random testing problem from software engineering, we observe 
that generating objects of a given class of input data according to a uniform distribu- 
tion is sufficient for testing the correctness of particular algorithms. However, if one 
intends to gather information about the "real-life behaviour" of the algorithm (e.g. with 
respect to runtime or space requirements), we need to perform simulations with input 
data that are as closely as possible related to corresponding application. This means to 
obtain suitable test data, we need to specify a distribution on the considered class that 
is similar to the one observed in real life and draw objects at random according to this 
(non-uniform) distribution. Deriving such a "realistic" distribution on a given class of 
objects can easily be done by modeling the class by an appropriate stochastic context- 
free grammar (SCFG). Details will follow in the next section. 

As regards RNA, it has been proven that both the combinatorial model (that is based 
on a uniform distribution such that all structures of a given size are equiprobable and 
that completely abstracts from the primary structure, see e.g. [29-31]) and the Ber- 
noulli-model (which is capable of incorporating information on the possible RNA 
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sequences for a given secondary structure, see e.g. [32-34]) for RNA secondary struc- 
tures are rather unrealistic. However, modeling these structures by an appropriate 
SCFG yields a more realistic RNA model, where the probability distribution on all 
structures is determined from a database of real world RNA data (see e.g. [35,36]). 

Based on this observation, the problem of non-uniform random generation of combi- 
natorial structures has been recently addressed in [20]. There, it is described how to 
get algorithms for the random generation of objects of a previously fixed size according 
to an arbitrary (non-uniform) distribution implied by a given SCFG. In principle, the 
construction scheme introduced in [20] extends on the recursive method for the (uni- 
form) random generation [25] and adapted it to the problem of unranking of [37]: the 
basic principle is that any (complex) combinatorial class can be decomposed into (or 
can be constructed from) simpler classes by using admissible constructions. 

Essentially, in [20], a new admissible construction called weighting has been intro- 
duced in order to make non-uniform random generation possible. By weighting, we 
understand the generation of distinguishable copies of objects. Formally: 

Definition 0.4. If A is a combinatorial class and X is an integer, the weighting of A by 

X is defined as ^ ' A + -y + A t w e w [\\ C3 \\ two objects from a combinatorial class 

X times 

copies of the same object iff they only differ in the tags added by weighting operations. 

For example, if we weight the class A = {a} by two, we assume the result to be the 
set {a, a}; weighting B = {b} by three generates {b,b,b}. Thus, 2^4 + 3B = {a, a, b, b, b] 
and within this class, a has relative frequency |, while b has relative frequency |. 
Hence, this way it becomes possible to regard non-uniformly distributed classes. 

As weighting a class can be replaced by a disjoint union, size (X A, n) = X • size (A, n) 
and the complexity results from [37] also hold for weighted classes. Hence, the corre- 
sponding class size computations up to n need 0(n 2 ) time. 

Stochastic Context-Free Grammars 

As already mentioned, stochastic context-free grammars (SCFGs) are a powerful tool for 
modeling combinatorial classes and the essence of the non-uniform random sampling 
approach that will be worked out in this article. Therefore, we will now give the 
needed background information. 
Basic Concepts 

Briefly, SCFGs are an extension of traditional CFGs: usual CFGs are only capable of 
modeling the class of all generated structures and thus inevitably induce a uniform dis- 
tribution on the objects, while SCFGs additionally produce a (non-uniform) probability 
distribution on the considered class of objects. In fact, an SCFG is derived by equip- 
ping the productions of a corresponding CFG with probabilities such that the induced 
distribution on the generated language models as closely as possible the distribution of 
the sample data. 
The needed formalities are given as follows: 

Definition 0.5 ( [38]). A weighted context-free grammar (WCFG) is a 5-tuple 
Q = (I, T, R, S, W), where / (resp. T) is an alphabet (finite set) of intermediate (resp. 
terminal) symbols (/ and T are disjoint), S e / is a distinguished intermediate symbol 
called axiom, R <= / x (/ u Tj* is a finite set of production rules and W : R — > R + is a 
mapping such that each rule fe R is equipped with a weight Wf : = W(f). If Q is a 
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WCFG, then Q is a stochastic context-free grammar (SCFG) iff the following additional 
restrictions hold: 

1. For all fe R, we have W(j) e (0,1], which means the weights are probabilities. 

2. The probabilities are chosen in such a way that for all A e /, we have 

y^- cJ , Q(jy A w f = \ where Q(/) denotes the premise of the production / i.e. the first 

component A of a production rule (A, a) e R. In the sequel, we will write Wf : A 
-» a instead of/= (A, a) e 7?, uy = 

However, at this point, we decided to not recall the basic concepts regarding SCFGs, 
as they are not really necessary for the understanding of this article. The interested 
reader is referred to the corresponding section in [21]. For a more fundamental intro- 
duction on stochastic context-free languages, see for example [39]. In fact, the only 
information needed in the sequel is that if structures are modeled by a consistent 
SCFG, then the probability distribution on the production rules of the SCFG implies a 
probability distribution on the words of the generated language and thus on the mod- 
eled structures. To ensure that a SCFG gets consistent, one can for example assign 
relative frequencies to the productions, which are computed by counting the produc- 
tion rules used in the leftmost derivations of a finite sample of words from the gener- 
ated language. For unambiguous SCFGs, the relative frequencies can actually be 
counted efficiently, as for every word, there is only one leftmost derivation to consider. 
Modeling RNA Secondary Structure via SCFGs 

Besides the popular planar graph representation of unknotted secondary structures, 
many other ways of formalizing RNA folding have been described in literature. One 
well-established example is the so called bar-bracket representation, where a secondary 
structure is modeled as a string over the alphabet E: = {(,), |}, with a bar | and a pair 
of corresponding brackets ( ) representing an unpaired nucleotide and two paired 
bases in the molecule, respectively (see, e.g. [30]). Obviously, both models abstract 
from primary structure, as they only consider the number of base pairs and unpaired 
bases and their positions. Moreover, there exists a one-to-one correspondence between 
both representations, as illustrated by the following example: 

Example 0.1. The secondary structure shown in Figure 1 has the following equivalent 
bar-bracket representation that can be decomposed into subwords corresponding to the 
basic structural motifs that are distinguished in state-of-the-art thermodynamic models: 

exterior loop 



||||(((( |||Mi|||M 2 ||h e l 3 | ))))ll, where 

multiloop (of degree 3) 
bulge left 2x2 interior loop 



heh = (((( Nil ((( \M ))) )))), hel 3 = (( || ( II II (((( III )))) lllllll ) II )), 

hairpin 2x7 interior loop 

multiloop (of degree 2) 



and heh = ((( II (((( \hel 2il I )))) \\\\\hel 2 ,2 ))),with 

1x7 interior loop 
single bulge left lxl interior loop 



hel xl = (( I ( JNM )) and hel 2 , 2 = ( |(( \M ))l ) 

hairpin hairpin 
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Note that the reading order of secondary structures is from left to right, which is due 
to the chemical structure of the molecule. 

Consequently, secondary structures without pseudoknots can be encoded as words of 
a context-free language and the class of all feasible structures can thus effectively be 
modeled via a corresponding CFG. Basically, that CFG can be constructed to describe 
a number of classical constraints (e.g. the presence of particular motifs in structures) 
and it can also express long-range interactions (e.g. base pairings). By extending it to a 
corresponding SCFG, we can also model the fact that specific motifs of RNA secondary 
structures are more likely to be folded at certain stages than others (and not all possi- 
ble motifs are equiprobable at any folding stage). 

In fact, it is known for a long time that SCFGs can be used to model RNA secondary 
structures (see e.g. [40]). Additionally, SCFGs have already been used successfully for 
the prediction of RNA secondary structure [14,15]. Moreoever, they can be employed 
for identifying structural motifs as well as for deriving stochastic RNA models that 
are - with respect to the expected shapes - more realistic than other models [36]. 
Furthermore, note that an SCFG mirror of the famous Turner energy model has been 
used in [21] to perform the first analytical analysis of the free energy of RNA second- 
ary structures; this SCFG marks a cornerstone between stochastic and pyhsics-based 
approaches towards RNA structure prediction. 
Random Generation With SCFGs 

SCFGs can easily be used for the random generation of combinatorial objects accord- 
ing to the probability distribution induced by a sample set, where the only problem is 
that they do not allow the user to fix the length of generated structures. In particular, 
given an SCFG Q and the corresponding language (combinatorial class) C(Q\ a ran- 
dom word w e C{Q) can be generated in the following way: 

♦ Start with the sentential form S (where S denotes the axiom of the grammar Q). 

♦ While there are non-terminal symbols (in the currently considered sentential 
form), do the following: 

1) Let A denote the leftmost non- terminal symbol. 

2) Draw a random number r from the interval (0,1]. 

3) Substitute symbol A by the right-hand side a of the production A — > a 
determined by the random number r. 

This means consider all m > 1 rules p 1 : A — » p m : A — > a m having left-hand 

Em 
pi = 1 must hold. Then, find 

k > 1 with Ya=\ Pi < r < Ya=\ Pi' i- e - determine k > 1 with r e ^ p if Pi ♦ 

The production corresponding to the randomly drawn number re (0,1] is then given 
by A — > a k and hence, in the currently considered sentential form, the non-terminal 
symbol A is substituted by a^. 

♦ If there are no more non-terminal symbols, then the currently considered senten- 
tial form is equal to a word w e C(Q)',w has been randomly generated. 
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Note that the choice of the production made in 3) according to the previously drawn 
random number is appropriate, since it is conform to the probability distribution on 
the grammar rules. 

Example 0.2. Consider the language generated by the SCFG with productions %: S — > e 
and M: S — > (5). Thus, we start with the sentential form S, then consider the leftmost non- 
terminal symbol, which is given by 5, and draw a random number r e (0,1]. If 0 <r < %, 
the production determined by r is S — > e and thus, we get the empty word and are fin- 
ished. Otherwise, % <r < % + % = 1, which means we have to consider A — > (S) for the 
substitution in step 3) and thus obtain the sentential form (5). Afterwards, we must repeat 
the process, as there is still one non-terminal symbol left. 

Unfortunately, there is one major problem that comes with this approach for the 
(non-uniform) random generation of combinatorial objects: The underlying (consis- 
tent) SCFG Q implies a probability distribution on the whole language C(G), such that 
we generate a word of arbitrary size. In order to fix the size, we can proceed along the 
following lines: 

1) We translate the grammar Q into a new framework which allows to consider 
fixed sizes for the random generation, such that 

2) the distribution implied on C{Q) conditioned on any fixed size n is kept within 
the new framework. 

A well-known approach which allows for 1) is connected to the concept of admissi- 
ble constructions used to describe a decomposable combinatorial class (see above). As 
the operations (like cartesian products, unions, and so on) used to construct the com- 
binatorial objects are also used to define an order on them, it becomes possible to 
identify the /th object of a given size and the problem of generating objects uniformly 
at random reduces to the problem of unranking, that is the problem of constructing 
the object of order (rank) i, for i a random number (see e.g. [41]). 

Remark. Some might think that with an appropriate SCFG (modeling a given class of 
objects) at hand, it is not really necessary to use an unranking method that implies 
cumbersome formalities such as admissible constructions and decomposable classes if 
we want to generate random objects of a fixed size n. As a matter of principle, they 
are right - we could also use a conditional sampling method: If we need to generate a 
word of size n from non-terminal symbol A, where there are m > 1 rules/ = A — » a b 
1 < i < m, having left-hand side A, then we just need to choose the next production/ 
according to 

Prob (A -> c*i=>* x\size(x) = n) 
Prob (A=>*x\size(x) = n) 

which is the posterior probability that we used production rule/ under the condition 
that a word of size n is generated. 

Similarly, if the production rule is of the type A — > BC (assuming the grammar is in 
Chomsky normal form (CNF), which does not pose a problem, as an unambiguous 
SCFG can be efficiently transformed into CNF [39]), we can choose a way to split size 
n into sizes ; and n - j for the lengths generated from non-terminal symbols B and C. 
This requires precomputing n length- dependent probabilities (i.e. all probabilities for 
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generating a word of any length up to n) for each non-terminal symbol, which might 
seem to be similar (with respect to complexity) to precomputing all class sizes up to n 
for all considered combinatorial (sub) classes as needs to be done for unranking. How- 
ever, there is a striking difference between the two approaches: While conditional sam- 
pling makes heavy use of rather small floating point values - with all the well-known 
problems and discomforting details like underflows or using logarithms associated with 
it - our unranking approach builds on integer values only which we assume a major 
advantage. There is another striking difference: length-dependent probabilities (which 
by the way yield a so-called length-dependent SCFG (LSCFG), see [42], and already 
have been used in [43]), require a very rich training set. In fact, if the RNA data set 
used for determining the distribution induced by the grammar is not rich enough, then 
the corresponding stochastic RNA model is underestimated and its quality decreases. 
This is especially a problem when considering comprehensive CFGs that distinguish 
between many different structural motifs in order to get a realistic picture of the mole- 
cules' behaviour; such a grammar should however be preferred over simple lightweight 
grammars as basis for a non-uniform random generation method. Nevertheless, this 
problem does not surface when sticking to conventional probabilities and the corre- 
sponding traditional SCFG model. Actually, since we consider a huge CFG where all 
possible structural motifs are created by distinct productions, we generally obtain rea- 
listic probability distributions and RNA models (see [21]). 

Finally note that of course we could make use of random sampling strategies origin- 
ally designed to sample structures connected to a given sequence in order to generate 
a random secondary structure only. However, such algorithms typically use a linear 
time to sample a single base pair (see, e.g., [6]) such that the time to sample a com- 
plete structure is quadratic in its length. This causes no problems for the original 
application of such algorithms since the sequence-dependent preprocessing which is 
part of their overall procedure is at least quadratic in time and thus the dominating 
part. Here our approach is of advantage (replacing a factor n by log(n)) and since our 
preprocessing only depends on the size of the structure to be generated it is performed 
once and stored to disk for later reuse. Last but not least we are not sure, if the differ- 
ent existing approaches just mentioned could easily be made as fast as ours by simple 
changes only. 

Bottom line is that hooking up to unranking of combinatorial classes offers three sig- 
nificant benefit compared to conditional sampling, namely a fast sampling strategy, the 
usage of integers instead of floating point values and a greater independence of the 
richness of the training data (compared to length-dependent models). For this reason, 
we assume our unranking algorithm a valuable contribution, even though it requires a 
more cumbersome framework. 

Unranking of Combinatorial Objects 

The problem of unranking can easily be solved along the composition of the objects at 
hand, i.e. the operations used for its construction, once we know the number of possi- 
ble choices for each substructure. Assume for example we want to unrank objects 
from a class C = A + B> We will assume all elements of A to be of smaller order than 
those of B (this way we use the construction of the class to imply an ordering). Finding 
the ith element of C, i.e. unranking class C, now becomes possible by deciding whether 
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A- In this case, we recursively call the unranking procedure for A- Otherwise (i.e. if 
i > card (^4)), we consider B, searching for its (i — card(*4)))th element. 

Formally, we first need to specify an order on all objects of the considered combina- 
torial class that have the same size. This can be done in a recursive way according to 
the admissible specification of the class: 

Definition 0.6 ( [37]). Neutral and atomic classes contain only one element, such 
that there is only one possible ordering. Furthermore, let <c n denote the ordering 
within the combinatorial class C n > then 

♦ If C = Ai + ... + Ak and y, y' e C n , then y < C ny' iff 

[y e (AT and / e (^) n andi < j] or [y,y r e {Ai) n andy<^. ) ny / ] . 

♦ If C = A x B and y = (a,p),y' = {a 1 p') e C n , then y< C ny' iff 

[size(a) < size(a')] or jj = size(a) = size(a') anda<^ya r J or [a = a' And p<^B)n-jP'] 

when considering the lexicographic order (1, 2, 3,..., ri), which is induced by the spe- 
cification C n = A° xB n + A 1 x B n ~ l +A 2 x B n ~ 2 + ... + i n xB°. 

♦ If C = A x B and y = (a,/3),y f = {a' , ft) e C n , then y< C ny f iff 

[min(size(a), size(^)) < min(size(a / ), size(^ / ))] or 

[min(size(a), size(^)) = min(size(a / ) / size(S / )) andsize(a) < size(c/)] or 
jj = size(a) = size(a / ) and a<^yc/J or = a and /3<^n- J /3 / j 

when considering the boustrophedon order (l,n, 2,n — 1, [|]), induced by the 
specification C n = A 0 x B n + A n x B° + A 1 x B n ~ l + A n ~ l x B 1 + ... 

Considering <c n , the actual unranking algorithms are quite straightforward. There- 
fore, they will not be presented here and we refer to [20,44] for details. 

Recall that in [20], the basic approach towards non-uniform random generation is 
weighting of combinatorial classes, as this makes it possible that the classes are non- 
uniformly distributed. If those combinatorial classes are to correspond to a considered 
SCFG, we have to face the problem that the maximum likelihood (ML) training intro- 
duces rational weights for the production rules while weighting as an admissible con- 
struction needs integer arguments. 

When translating rational probabilities into integral weights, we have to assure that 
the relative weight of each (unambiguously) generated word remains unchanged. This 
can be reached by scaling all productions by the same factor (common denominator of 
all probabilities), while ensuring that derivations are of equal length for words of the 
same size (ensured by using grammars in CNF). However, a much more elegant way is 
to scale each production according to its contribution to the length of the word gener- 
ated, that is, productions lengthening the word by h will be scaled by c k . Since we con- 
sider CFGs, the lengthening of a production of the form A — > a is given by \a\ - 1. 
However, this rule leads to productions with a conclusion of length 1 not being 
reweighted, hence we have to assure that all those productions already have integral 
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weights. Furthermore, e-productions need a special treatment. We don't want to dis- 
cuss full details here and conclude by noticing that the reweighting normal form (RNF) 
keeps track of all possible issues: 

Definition 0.7 ( [20]). If Q = (I, T, R, S, W) is a WCFG, Q is said to be in reweighting 
normal form (RNF) iff 

1. Q is loop-free and e-free. 

2. For all A -» a e R with A = S, we have \a\ < 1. 

3. For all A -» a e with A * <S, we have |a| > 1 or W{A -» a) e N. 

4. For flIMe / there exists ae (/U I)* such that A^ae 

Note that the last condition (that any intermediate symbol occurs as premise of at 
least one production) is not required for reweighting, but necessary for the translation 
of a grammar into an admissible specification. 

Definition 0.8 ( [20]). A WCFG Q is called loop-free iff there exists no nonempty 
derivation A =^> + A for A e /. It is called e-free iff there exists no (A, e) e R with A = S 
and there exists no (A, aiSa 2 ) e 7?, where e denotes the empty word. 

If Q and are WCFGs, then Q and are said to be word- equivalent iff 
C{G) = C{G') and for each word w e C{Q\ we have W(w) = W(w). 

In [20], it is shown how to transform an arbitrary WCFG to a word-equivalent, loop- 
free and e-free grammar, that grammar to one in RNF and the latter to the corre- 
sponding admissible specification. Formally: 

Theorem 0.1 ( [39]). If Qis a SCFG, there exists a SCFG Q'in Chomsky normal form 
(CNF) that is word-equivalent to Q, and Q'can be effectively constructed from Q. 

The construction given in [39] assumes that Q is e-free. It can however be extended 
to non- e-free grammars by adding an additional step after the intermediate grammar Q 
has been created (see e.g. [20]). Furthermore, it should be noted that an unambiguous 
grammar is inevitably loop-free. 

Theorem 0.2 ( [20]). If Qis a loop-free, e-free WCFG, there exists a WCFG Q'in RNF 
that is word-equivalent to Qand Q'can be effectively constructed from Q. 

Altogether, starting with an arbitrary unambiguous SCFG Go that models the class of 
objects to be randomly generated, we have to proceed along the following lines: 

♦ Transform Q 0 to a corresponding e-free and loop-free SCFG Q x . 

♦ Transform Qi into Qi in RNF (where all production weights are rational). 

♦ Reweight the production rules of Q 2 (such that all production weights are inte- 
gral), yielding reweighted WCFG Q 3 . 

♦ Transform Q 3 (with integral weights) into the corresponding admissible 
specification. 

♦ This specification (with weighted classes) can be translated directly 

- into a recursion for the function size of all involved combinatorial (sub) 
classes (where class sizes are weighted) and 

- into generating algorithms for the specified (weighted) classes, 

yielding the desired weighted unranking algorithm for generating random elements 
of£(£o). 
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A small example that shows how to proceed from SCFG to reweighted normal form 
and the corresponding weighted combinatorial classes which allow for non-uniform 
generation by means of unranking is discussed in the Appendix. 

Generating Random RNA Secondary Structures 

We will now consider the previously discussed approach to construct a weighted 
unranking algorithm that generates random RNA secondary structures of a given size 
according to a realistic probability distribution. As for this paper, the corresponding 
probability distribution will be induced by a set of sample (SSU and LSU r)RNA sec- 
ondary structures from the databases [45,46], which will be referred to as biological 
database in the sequel. However, the presented algorithm can easily be used for any 
other distribution, which can be defined by a database of known RNA structures of a 
particular RNA type; our webservice implementation accessible at http://wwwagak.cs. 
uni-kl.de/NonUniRandGen is actually able to sample random secondary structures of 
any specified RNA type. A link to download an implementation of our algorithm (in 
Wolfram Mathematica) can be found there, too. 

Considered Combinatorial Class 

According to the common definition of RNA secondary structure, we decided to con- 
sider the combinatorial class of all RNA secondary structures without pseudoknots 
that meet the stereochemical constraint of hairpin loops consisting of at least 3 
unpaired nucleotides, formally: 

Definition 0.9 ( [21]). The language C containing exactly all RNA secondary struc- 
tures is given by (note that according to this definition, completely unpaired structures 
are prohibited) C := C U C\ U) where L\ u := (Ci)C u , C u := {|}* is the language of all bar- 
bracket representations of single-stranded regions and C\ is the language of all bar- 
bracket representations of other possible substructures, i.e. is the smallest language 
satisfying the following conditions: 

1- {|} + \{|a 11} C C\ (bar-bracket representations of hairpin loops), 

2. If w e £/, then {w) e C\ (bar-bracket representation of a stacked pair). 

3. If w e Cy then {|} + (w) C £/ and (w){|} + C £/ (bar-bracket representations of 
bulge loops). 

4. If w e C\, then {|} + (u/){|} + C C\ (bar-bracket representations of interior loops). 

5. If w\, ...,w n g C\ and n > 2, then C u (wi)C u (w 2 ) ■ ■ ■ jC u {w n )C u C C\ (bar-bracket 
representations of multibranched loops). 

The desired weighted unranking algorithm thus generates, for a given size n and a 
given number i e {0, card(X n ) — 1}, the zth secondary structure 5 e C n , where 
card(X n ) = size(X, n) is the number of elements in the weighted class £ n . 

Considered SCFG Model 

First, we have to find a suitable SCFG that generates £ and models the distribution of 
the sample data as closely as possible. To reach this goal, it is important to appropri- 
ately specify the set of production rules in order to guarantee that all substructures 
that have to be distinguished are generated by different rules. This is due to the fact 
that by using only one production rule f to generate different substructures (e.g. any 
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unpaired nucleotides independent of the type of loop they belong to), there is only one 
weight (the probability pj of this production j) with which any of these substructures is 
generated, whereas the use of different rules fi,..., fk to distinguish between these sub- 
structures implies that they may be generated with different probabilities Pf^—Pfa, 
where pf x + ••• + Pf k = Pf. This way, we ensure that more common substructures are gen- 
erated with higher probabilities than less common ones. 

Example 0.3. A (rather simple) unambiguous SCFG Q s generating the language C is 
given by: 

Wi.Ss -> CA, 

w 2 :A -> (£)C, w 3 :A -> (B)CA, 
w 4 :B^\\\C, w 5 :B^CA, 
we'.C^e, w 7 :C —> \C. 

This grammar unambiguously generates £ for the following reasons: 

♦ Every sentential form C(B)C(B) ... (B)C obviously is generated in a unique way; 
this resembles C = C u Cf u and C\ u := [Ci)C u of Cfs definition. The number of outer- 
most pairs of brackets in the entire string uniquely determines the corresponding 
sentential form to be used. 

♦ Now B either generates a hairpin-loop |~ 3 , which unambiguously is possible by 
rules B — > 1 1 1 C, C — > | C and C — > e, or 

♦ B itself has to generate at least one additional pair of brackets. In this case, B — > 
CA must be applied (only A can generate brackets) and then A — > (B)C resp. A 
(B)CA are used; the number of outermost brackets to be generated (from B under 
consideration) again uniquely determines that part of the derivation. 

When changing the production w 5 : B — > CA used to generate any possible /c-loop for 
h > 2 (any loop that is not a hairpin loop) with probability w 5 into the two rules 

u/5.1 : B -> C{B)C, w 5 _ 2 : B -> C{B)CA, 

where w 5A + w 5 2 - w 5 , it becomes possible to generate any possible 2-loop (i.e. a 
stacked pair, a bulge (on the left or on the right), or an interior loop) and all kinds of 
multiloops (i.e. any /c-loop with h > 3) with different probabilities, which could increase 
the accuracy of the SCFG model. By additionally replacing the first of these two new 
rules, w 5 1 : B — > C(B)C, by the four productions 

W5.1.1 : B -> (B), ws.x.i ■ B -> |C(B), w 5A3 : B -> (B)C| , w 5 .ia ■ B -> |C(B)C|, 

where (m/ 511 + ... + w 5 .ia) + ^5.2 = W5.1 + ^5.2 = we can distinguish between the 
different types of 2-loops more accurately, yielding a more realistic secondary structure 
model. In fact, in the case of significant differences of the new probabilities {w 5AA , 
W5.1.4 an d w 5 2), we can expect a huge improvement in the model's accuracy. Note that 
it is not hard to see that changes to a grammar like the ones just discussed do not 
change the language generated. However, this is not at all obvious with respect to 
ambiguity of the grammar. 

According to the previously mentioned facts (and the corresponding illustrations by 
Example 0.3), we decided that the basis for our weighted unranking algorithm should 
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be the following e-free, loop-free and unambiguous (note that these are exactly the 
preliminary required conditions for the basis SCFG according to [20]) SCFG, which 
has been derived from the sophisticated SCFG presented in [21] that distinguishes 
between all known structural motifs that can be found in RNA secondary structure: 
Definition 0.10. The unambiguous e-free SCFG Q si0 generating exactly the language 

C is given by G sto = (i^, £g sto , R^, S') where 

Iq = {S f , E, S, T, C, A, L, G, D, B, F, H, P, Q, R, V, W, O, J, KM, X, Y, Z, N, U}, 
Hg sto = {(/ )' I) and Rg sto contains exactly the following rules: 



h : S' - 

h-C^ 
pw '■ A—, 
Pn :L—> 
pis ■ 1 "> 
Pi6 : L -> 
Pis : G - 

P22 : D - 
P23 : B - 
pis : F - 
p 28 : H - 
Pso : P - 
p 34 : Q - 
p36 : P - 
p38 : V - 
P39 : W 
p 40 : O - 

p4i : / -> 
P43-K- 
p 46 : Ad - 
p 46 : X - 
p48 : y - 
p 49 : Z - 
Psi : N - 
P53 : U - 



S, p 3 : £ -> SC, 
A, p 5 : S -> TA, 
E, p 7 : T -> C, . 

p 9 : C -> C|, - 
(L), -w initiate helix 
A, pi2 : L ^ Ad, 



shape of exterior loop 
strands in exterior loop 



A, pij : L — >> Ad, initiate stacked pair or multiple loop 
P, pi4 : L -> Q, pi 5 : L -> R, ~> initiate interior loop 
F, pu : L — >• G, initiate hairpin loop or bulge loop 
A — , pi 9 : G -> AD, P20 : G -> — A, p 2 i : G ->> DA, -w shape of bulge loop 



-, P24 : B — >• B— , 



strands in bulge loop 



-I- p 26 :P^ , p 27 :P^ H,| 

— , p 29 : H -> H — , I 



hairpin loop 



— A — , p 3 i : P -> |A| — , p 32 : P — |A|, p 33 : P -> — |A , small interior loops 

other interior loops 



_| 0 , p 35 :Q^ 

IO , p 37 : R -> W|, 

► /O, 
>/A, 

► Alt, 

- p42 : / -* /- ] . . . . , 

} strands m interior loop 
* -, p 44 .K^K-] 

► XY, 

-A, p 47 : X -> L/A, 

- Z, multiple loop 

- X, p 50 : Z -> XN, 

► Z, p 52 : N -> U, 
-, p 54 : U -> 



strands in multiple loop 



Figures 2 and 3 illustrate by examples how (parts of) secondary structures are gener- 
ated by this SCFG, where we used 4- to denote the full parse tree for / =>* x (i.e. for 
consecutive applications of an arbitrary number of production rules that generate the 
subword x from the intermediate symbol /) in oder to obtain a more compact tree 
representation. In fact, it is easy to see that the overall structure is always produced by 
starting with the axiom S\ while any particular substructure or structural motif that 
belongs to the combinatorial (sub)class X is created from the corresponding intermedi- 
ate symbol /. 

For our application it is crucial that g st0 - as claimed its definition - is unambiguous. 
To prove this, we first note that g st0 has been constructed starting from a simple gram- 
mar which generates C by iteratively replacing one production by several ones (like we 



Nebel et al. Algorithms for Molecular Biology 201 1, 6:24 
http://www.almob.org/content/671/24 



Page 1 8 of 46 





Figure 2 Unique parse tree for the bar-bracket word considered in Example 0.1 that corresponds 
to the planar secondary structure from Figure 1. 



did in the previous example) in order to distinguish more and more structural motifs 
but without changing the language generated. Furthermore, a standard construction to 
make the grammar e-free has been applied. That way, we can be sure that Q st0 gener- 
ates C (formally this fact easily follows by obvious bi-simulation proofs for each substi- 
tution and by the proven correctness of the used construction to ensure e-freeness). 
To prove unambiguity, we translate Q sl0 into a system of equations for its structure 
generating function (see [47] for details) S[z] = J2weC^i w ) z ^ wnere d{w) denotes the 
number of derivation trees Q st0 offers for w. Eliminating all but the variable associated 
with the axiom and simplifying (for this step we made use of Mathematica) yields the 
single equation 

-z 5 + S[z](-1 +z)(-l +z(2 - S[z](-1 +z)z + z 4 )) = 0 
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This equation is exactly the one of our grammar Q s from Example 0.3 which proves 
that for all n both grammars have the same number of derivation trees for words of 
size n. Knowing that both grammars generate £ and that Q s is unambiguous, the same 
can be concluded for g st0 . 

Note that g st0 contains more production rules (and more different non-terminal sym- 
bols) than the SCFG considered in [21], but this new grammar is e-free and addition- 
ally, the right-hand side of every single production contains at most two non-terminal 
symbols, such that the resulting unranking algorithm has to consider less cases (i.e. 
less "else if ( )" cases). For details, see [20] and the Appendix. 

Furthermore, it should be mentioned that we decided to assign relative frequencies 
to the production rules of g st0 > since such probabilities can be computed efficiently for 
unambiguous SCFGs. Moreover, by estimating the probabilities p if 1 < i < 54, by their 
relative frequencies, the resulting grammar g sW has the consistency property, which 
means g st0 provides a probability distribution on the language £{g sto ) = £. In particu- 
lar, it is well-known that relative frequencies in our context yield a maximum likeli- 
hood (ML) estimator for the rule probabilities and thus a consistent estimator for the 
parameter set. We have trained the probabilities (relative frequencies) of g st0 from the 

structures s e £ (ftto) given in our biological database. The resulting probabilities are 

given in Table 1, their floating point approximations, rounded to the third decimal 
place in Table 2. 

In oder to see if over-fitting is an issue for our sophisticated grammar and its rich 
parameter set, i.e. to see if our training set is large enough to derive reliable values for 
the rule probabilities, we performed the following experiments: We selected a random 
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Table 1 The probabilities (relative frequencies) for the production rules of the SCFG Q 
obtained by training it using our biological database 



Nonterminal Nt 



Probabilities of Rules with Premise Nt 



G 
D 
B 
F 
H 
P 
Q 

R 
V 



:= 1 



Pn 



pis 



605069 
792975' 

pis : 
11667 
38399' 



P2 

h :-- 
p 8 := 

p\i := 

1893 
264325 



137 
6476' 


p3 


6339 
6476' 


177 


Ps 


12775 


12952' 


12952' 


11086 


Pi 


1689 


" 12775' 


"~ 12775' 


14367 


p9 


134611 


148978' 


148978' 


3191^° 


■= 1, 

, Pu : 


4912 


792975 


264325' 



P30 



P25 := 



533 
4912' 



Pl9 

P23 := 
3912 
68075' 

p28 := 
p3i := 
p34 := 

p36 : 



/ Pi6 := 
7235 



2723 
31719 

P20 := 



38399 

P22 := 1a 
4967 



12748 

p26 := 
8191 

40700' 
1053 



4912 
4986 

29105' 

2357 

" 5679' 



p24 := 
23208 
68075' 

P29 := 
P32 := 

P35 
P37 

:= 1, 



P\7 

11831 
" 38399' 

7781 
" 12748' 

P27 := 
32509 

40700' 
2963 

14736' 
24119 



/ Pi4 := 

38399 
792975' 



5821 
158595' 



29105 

3322 

5679' 



p21 



7666 
38399' 



8191 
13615' 



P33 



7015 
14736' 



w 

0 

J 

K 
M 
X 
Y 
Z 



p41 



p43 := 



P46 



P49 



Psi := 



P53 := 



P39 := 1a 

p4o := 1/ 
27441 

/ P42 

84620 v 
15731 
53725' 

p45 

6196 
87035' 

P48 

2812 
55123' 
7737 
17437' 
109939 
518817' 



p44 := 
= 1, 

P47 ■ = 
= I, 

pso := 
P52 := 
p54 := 



57179 
84620' 
37994 
53725' 



80839 
87035' 



52311 
55123' 
9700 
17437' 
408878 
518817' 
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Table 2 Floating point approximations of the probabilities (relative frequencies) for the 
production rules of the SCFG g sW (rounded to three decimal places) 

Nonterminal Nt Probabilities of Rules with Premise Nt 



S' p! := 1.000, 



E 








0.021, 


P3 : 


= 0.979, 






S 






y 4 . 


0.014, 


Ps : 


= 0.986, 






T 






fir ' = 
yb • 


0.868, 


P7 : 


= 0.132, 






c 






ys • 


0.096, 


P9 : 


= 0.904, 






A 








pio := 


1.000 






L 


r 11 


= 0.763, 


yn • 


0.040, 


Pl3 


:= 0.019, 


y 14 • 


0.037, 






y\5 - 


0.007, 


Sir ' = 


0.086, pi 7 := 


0.048, 




G 


y 18 


■= 0.304, 


ill a ' = 


0.188, 


P20 


:= 0.308, 


y2i • 


0.200, 


D 








y22 • 


1.000 






B 






y23 ■ 


0.390, 


p24 


:= 0.610, 






F 






0.057, 


y 2b • 


0.341, p 27 := 


0.602, 




H 






y28 • 


0.201, 


P29 


:= 0.799, 






P 


V 30 


:= 0.109, 


y3l • 


0.214, 


P32 


:= 0.201, 


fin O ' = 

y33 • 


0.476, 


Q 








0.171, 


P35 


:= 0.829, 






R 






p36 - = 


0.415, 


P37 


:= 0.585, 






V 








p38 := 


1.000 






W 








P39 := 


1.000 






0 








P40 := 


1.000 






J 






p4i := 


0.324, 


p42 


:= 0.676, 






K 






p43 := 


0.293, 


p44 


:= 0.707, 






M 








p45 := 


1.0000 






X 






p46 := 


0.071, 


P47 


:= 0.929, 






Y 








p48 := 


1.0000 






Z 






P49 := 


0.051, 


P50 


:= 0.949, 






N 






hi := 


0.444, 


P52 


:= 0.556, 






U 






p53 := 


0.212, 


P54 


:= 0.788. 







90% (resp. 50%) portion of the original training set and re-estimated the probabilities 
of all the grammar rules. This process was iterated 40 times, resulting in a sample of 
40 parameter sets. Finally, for each parameter we determined its variance along this 
sample of size 40. The corresponding values lay between 0 (resulting for intermediate 
symbols without alternatives; for whose productions a probability of 1 is predeter- 
mined) and 2.87652 x 10" 6 (resp. 2.86242 x 10" 5 ). We can conclude that over-fitting is 
no issue in connection with our sophisticated grammar and the training set used. 

Derivation of the Algorithm 

The elaborate SCFG Q si0 is appropriate for being used as the basis for the desired 
weighed unranking method: after having determined the RNF of this SCFG and the 
corresponding weighted combinatorial classes, we easily find a recursion for the size 
function (in the same ways as discussed in Example App-.4). Then, we can use the 
resulting weighted class sizes for the straightforward construction of the desired 
unranking algorithm. 

In fact, for the construction of the complete algorithm, we simply have to use Algo- 
rithms 1 to 4 (Unranking of neutral classes, atomic classes, disjoint unions and carte- 
sian products, respectively) and Algorithm 6 (Unranking of weighted classes) given in 
[20] as subroutines. However, to improve the worst-case complexity of the resulting 
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unranking procedure from 0(n 2 ) to O (n • log(n)) by using the boustrophedonic order 
instead of the sequential order, a simple change in Algorithm 4 (Unranking of cartesian 
products) is neccessary (see e.g. [7]). 

A random RNA secondary structure of size n can easily be computed by drawing a 
random number i e {0, . . . , size(X, n) — 1} and then unranking the ith structure of 
size n. The worst-case runtime complexity of this procedure is equal to that of unrank- 
ing and is thus given by O (n • log(n)) when using the boustrophedonic order. By 
repeating this procedure m times, a set of m (not necessarily distinct) random RNA 
secondary structures of size n can be generated in time O (m • n • log(n)), where a pre- 
processing time of 0(n 2 ) is required for the computation of all (weighted) class sizes 
up to input length n. 

A complete and detailed description of the derivation of our weighted unranking 
algorithm for (SSU and LSU r)RNA secondary structures can be found in the Appen- 
dix, since it is too comprehensive to be presented here and the different steps for its 
generation correspond to those described in [20]. 

Availability of Software 

It may be of interest to the reader that this non-uniform random generation algorithm 
for RNA secondary structures has been implemented as a webservice which is accessi- 
ble to the scientific community under http://wwwagak.cs.uni-kl.de/NonUniRandGen. 
Since it is relevant for researchers to have methods available for generating random 
structures that are realistic for a particular investigation, this webservice is also capable 
of allowing the user to specify the distribution from which the corresponding struc- 
tures should be sampled (in the form of a set of secondary structure samples from 
which the parameters for our grammars are inferred). Furthermore, our Mathematica 
source code used to implement the webservice can be downloaded from our website 
and used under GNU public licence. 

Discussion 

The purpose of this section is to analyze the quality of randomly generated structures 
by considering some experimental results. 

Parameters for Structural Motifs 

As a first step, we decided to consider several important parameters related to particu- 
lar structural motifs of RNA secondary structure and compare the observed statistical 
values derived from a native sample (here our biological database, i.e. the set of real- 
life RNA data that we used for deriving the distribution and thus the weights for the 
unranking algorithm) to those derived from a corresponding random sample (i.e. a set 
of random structures generated by our algorithm). In order to obtain an appropriate 
random sample, we have generated exactly one random structure of size n for each 
native RNA structure of size n given in our database, such that for each occurring size 
n, the random sample and the native sample contain the same number of structures 
having this size. 

The determined results are presented in Table 3. Comparing the specific values of all 
different parameters, we can guess that our algorithm produces random RNA second- 
ary structures that are, related to the different structural motifs and thus related to the 
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Table 3 Expectation and variance of important parameters related to particular 
structural motifs of RNA secondary structure 



Parameter 


Expected Value 


Variance 






Random 


Native 


Random 


Native 


num unp 


848.179 


839.956 


98964.7 


103426. 


num bps 


420.848 


424.96 


27785.3 


31310.9 


num urs 


1 79.73 


181.822 


4959.96 


5117.47 


num e 


1. 


1. 


0. 


0. 


num h 


36.6983 


36.4818 


196.935 


185.596 


num s 


321.18 


324.26 


16538.8 


19343.4 


num b 


20.6061 


20.5782 


87.1894 


50.3103 


num. 


26.1442 


26.538 


125.66 


194.769 


num m 


16.2197 


17.1018 


57.8874 


41.0261 


num he | 


99.6683 


100.7 


1549.24 


1492.84 


unp e 


106.014 


79.8382 


4039.69 


3897.61 


unp h 


6.93534 


6.93188 


18.4264 


77.464 


unp s 






- 


- 


unp b 


1 .9948 


1 .99596 


3.10283 


6.87868 


unp. 


7.14617 


7.08869 


16.5725 


31.1197 


unp m 


16.0122 


16.2577 


87.4906 


195.497 


unp h ei 










bpS e 


9.41479 


6.94105 


29.1956 


6.30949 


bps h 










bps s 


1. 


1. 


0. 


0. 


bps b 


1. 


1. 


0. 


0. 


bps. 


1. 


1. 


0. 


0. 


bps m 


2.68212 


2.72734 


1.12921 


1.21643 


bpS h e| 


4.22249 


4.22006 


13.6266 


5.52299 



Values are derived from a native sample (our biological database) and from a random sample, respectively. 
num x denotes the number of occurrences of motif x in one secondary structure and unp x (bps x ) denotes the number of 
accessible unpaired bases (base pairs) in one substructure of type x. unp, bps, urs denote unpaired bases, base pairs and 
unpaired regions, whereas e, h, s, b, /', m, hel denote exterior loop, hairpin loop, stacked pair, bulge loop, interior loop, 
multiloop and helix, respectively. 



expected shape of such structures, in most cases realistic. Obviously, this is a major 
improvement over existing approaches for the random generation of secondary struc- 
tures of a given input size n (where the corresponding specific RNA sequence is not 
known, but only its length n), as those (sequence-independent) methods are only cap- 
able of generating structures uniformly at random for input size n. Furthermore, with 
the SCFG model used here, we have an new model for RNA secondary structures at 
hand which realistically reflects the structure of an RNA molecule and its basic struc- 
tural motifs. 

Related Free Energies 

For further investigation on the accuracy of our random generator, we take on a com- 
pletely different point of view and consider thermodynamics. The reason behind this 
idea is that if an RNA secondary structure model induced by a SCFG shows a realistic 
behaviour (expectation and variance) with respect to minimum free energy, then it is 
rather likely that our grammar also shows a realistic picture for all the different struc- 
tural motifs of a molecule's folding (as the free energy of a molecule's structure is 
defined as the sum of the energy contributions of all its substructures). 
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Since we do not know the corresponding RNA sequences for the randomly generated 
structures, we can not use one of the common sequence-dependent thermodynamic 
models for RNAs. Therefore, we decided to consider both the static and dynamic free 
energy models (in the static model, averaged free energy contributions for the distin- 
guished structural motifs are considered which can easily be derived from the training 
data (by sequence counting). These averaged values actually represent the free energy 
contributions that have to be added for the respective whole substructures. For the 
dynamic model, corresponding average values for length-dependent free energy contri- 
butions (that depend on the number of unpaired or paired bases within particular sub- 
structures) are added for each component (unpaired base or base pair) in the 
respective motifs, such that in contrast to the static model, substructures of different 
lengths are assigned different free energy values) defined in [21] for RNA secondary 
structures with unknown sequence. These models are based on the well-known Turner 
energy model [22,23] and model parameters have been derived from the same biologi- 
cal database (of SSU and LSU rRNAs) that we consider in this article. In fact, both 
models have turned out to show a realistic behaviour and can therefore be used to 
judge the quality of random structures generated by our algorithm. 
Unquantified Results 

Similar to [21], we denote the free energy of a given secondary structure 5 e C according 
to the static and dynamic model by g sta t(s) and gd yn (s)> respectively. Moreover, the 
expected free energy and corresponding variance that have been analytically derived in 
that paper for any n > 0 are denoted by ii e nergy,n '•= E [energy(s) | size(5) = n] and 
Vmergy.n := v [energy(s) I size(s) = n] respectively, where energy e {g staP g dyn }. The corre- 



sponding confidence interval for n > 0 and k > 1, which contains at least I 100 — 



percent of the energies in [energy{s) \ s e C n ] is denoted by I energy>n (k)\ = {^ ene rgy,n - toe- 
nergyw ftener g y,n + ^energy,n}- As these analytical energy results from [21] and our unrank- 
ing algorithm have been derived from the same database of real-life RNA data and by 
modeling the same class C of structures via very similar SCFGs, it seems adequate to use 
them for comparisons with the energies of our randomly generated structures. 

Before we start with our comparisons, note that for any sample set S of secondary 
structures, we can calculate the corresponding energy points 
EP{S t energy) := {(size(s), energy{s)) \ s e S}, where energy e {g stat1 g dyn }. Obviously, we 
can also compute the corresponding "average energy points" 

AvEP(S, energy) := | [n, \x n := — ^ J2 S eS n energy (s)) | S n i- 0 1 and the correspond- 
[ card(o n ) J 

ing "energy variance points" 



In the sequel, we will denote a random sample generated by our algorithm by M and a 
native sample (biological database) by J\f. 

In order to obtain an appropriate random sample for our energy comparisons, we 
derived a large set of random structures by generating 1000 RNA secondary structures 
for each of the sizes n e {500,1000,1500,..., 5000, 5500} with our weighted unranking 
algorithm. To compare the energies of our randomly generated structures to the corre- 
sponding confidence interval(s), we decided to consider any k e {Vl, 2, VlO, V20}> 




VarEP(S, energy) := | (n, 



1 



J2ses- {^n ~ energy{s)) \ S n j 0 V respectively. 



card(S") 
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meaning the probability that the free energy of a random RNA secondary structure of 
size n lies within the corresponding interval is greater than 0.5, 0.75, 0.9, and 0.95, 
respectively. 

Figure 4 shows a plot of the corresponding four confidence intervals (analytically 
derived, related to our biological data) along with the energy points for our random 
sample and for our native database, respectively, under the assumption of the static 
energy model. The corresponding plots for the dynamic energy model are shown in 
Figure 5. Looking at both figures, we immediately see that the energies for our set of 
randomly generated RNA secondary structures seem to fit to the ones for the consid- 
ered RNA database and also to the corresponding analytically obtained energy results 
from [21]. This observation becomes even more clear by considering Figures 6 and 7. 
There, we compare the previously introduced "average energy points" and "energy var- 
iance points" to the analytically determined expected free energy and corresponding 
variance from [21], respectively. 
Quantified Results 

The previously considered energy comparisons have been presented only by unquali- 
fied plots. This may not be very satisfying, since it is obvious that the free energy 
would decrease with structure size and aside from this, it could have been expected 
that for large randomly generated sets of structures of a given size, the average energy 
and corresponding variance fit the analytically obtained energy results derived under 
the assumption of a basically equivalent SCFG model for secondary structures. There- 
fore, there is a need to consider some sort of quantification and additionally present 
corresponding quantified comparison results. What really matters is the degree to 
which the energy ranges of the random structures agree, in distribution, with our bio- 
logical database. This means we have to find out if the energies related to a random 
sample (generated by our unranking method) and those related to a native sample 
(given by the structures in our biological database) come from a common distribution. 




Figure 4 Plots of the confidence intervals w«. Intervals are shown for the static energy model (blue), 
for fe€(V2, 2 ,yio,v20) (top left to bottom right), together with the corresponding energy points awa-ifor the 
random sample (cyan) and for the native sample (green). 
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Figure 6 Plot of expectations of the free energy. Plots show ^g sm ,n (blue) and t^gdyn.n (purple) of a 
random RNA secondary structure of size n, together with the "average energy points" 
Ai/EP(7£,g 5tat )(cyan) and AvEP{lZ, gdyn) (magenta) for the random sample. 



Consequently, we have to consider the energies of a random sample and those of a 
native one as two independent sets of values and determine the extend to which their 
distributions coincide, or in other words to test for significant differences between 
these two sets. For this reason, we decided to apply one of the most common (non- 
parametric) significance tests known from statistics, the so-called Mann-Whitney 
U-test [48], which is widely used as statistical hypothesis test for assessing whether two 
independent samples of observations (with arbitrary sample sizes) come from the same 
distribution. It is also known as the Wilcoxon rank-sum test [49] which however can 
only be applied for equal sample sizes. 




Figure 5 Plots of the confidence intervals v»w. Intervals are shown for the dynamic energy model 
(purple), for kew^vmv^ (top left to bottom right), together with the corresponding energy points HW**»jfor 
the random sample (magenta) and ^ for the native sample (yellow). 
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1000 2000 3000 4000 5000 6000 

Figure 7 Plot of variances of the free energy. Plots show Gg smiYl (blue) and °g dynin (purple) of a random 
RNA secondary structure of size n, together with the "energy variance points" 
VarEP(7Z, g s tat) (cyan) and VarEP(lZ, gdyn) (magenta) for the random sample. 



Formally, this test is used to check whether the null hypothesis N 0 - which states that 
the two independent samples X and Fare identically distributed (i.e. F(X) = F(Y)) - can 
be accepted or else, has to be rejected. More specifically, the result of such a test, the 
so-called p-value, is a probability answering the following question: If the two samples 
really have the same distribution, what is the probability that the observed difference is 
due to chance alone? In other words, were the deviations (differences between the two 
samples) the result of chance, or were they due to other factors and how much devia- 
tion can occur before one must conclude that something other than chance causes the 
differences? The ^-value is called statistically significant if it is unlikely that the differ- 
ences occurred by chance alone, according to a preliminary chosen threshold probabil- 
ity, the significance level a (common choices are e.g. a e {0.10,0.05,0.01}). If p > a, the 
deviation is small enough that chance alone accounts for it; this is within the range of 
acceptable deviation. If p < a, we must conclude that some factor other than chance 
causes the deviation to be so great, this will lead us to decide that the two sets come 
from different distributions. 

For our analysis, we again decided to generate the same numbers of random struc- 
tures for any size as are given for this size in our biological database, such that random 
and native sample contain the same numbers of structures for any occurring size (and 
hence the sample sizes are equal). Moreover, note that the unquantified results pre- 
sented in Figures 4 and 5 might yield the assumption that for any structure size, some 
energy values of randomly generated structures are scattered too widely around the 
corresponding expected value, such that those randomly drawn secondary structures 
can not be considered realistic (neither with respect to thermodynamics nor with 
respect to structural composition and expected shape). In an attempt to disprove that 
assumption, we decided to perform a series of Wilcoxon tests by considering a number 
of different random samples. These samples are created by obeying a specified energy- 
based rejection scheme: Do not add a randomly generated structure of a given size to 
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the sample if its free energy (according to the static or dynamic model or according to 
both models) lies outside the corresponding confidence interval(s). Formally, for any 
preliminary chosen value k > 1, a generated structure 5 g C n is added to the random 
sample iff 

[gstat e hstat.nik) (variant "static")] or [g dyn e I gdYn , n {k) (variant "dynamic")] or 
[&tat € IgstatA k ) gdyn e I gdyn , n (k) (variant "both")]; 

otherwise it is rejected. This means we accept only a specified deviation of the 
energy energy (s) of the random structure s from the corresponding expected free 
energy ft energ y,n and reject structures whose energy differs too much from the expected 
value. Note that for k = 00 (confidence interval I energ y )n {k) contains 100 percent of the 
energies energy{s) of all 5 e £ n ), no structures are rejected. Hence, in this case, the cor- 
responding random sample corresponds to the usual (unrestricted) output of our 
algorithm. 

The Wilcoxon test results for our native sample together with any of a number of 
random sample sets generated in the previously described restricted manner, respec- 
tively, can be found in Table 4. As we can see, the best results are achieved for the 
unrestricted sample sets, where all free energies of randomly generated structures were 
allowed during the sample creation process. Moreover, these two results (for the 
unrestricted case k = 00) are not statistically significant when considering the common 
significance level a = 0.05, that is in both cases, we can assume that the energies of 
the random structures and those of the biological data follow a common distribution. 
These observations indicate that our weighted unranking algorithm produces random 
RNA secondary structures that are - related to the free energy of such structures (in 
expectation and variation) - in expectation realistic. 

Besides that, it is obvious that the computed ^-values are much better for the 
dynamic energy model than for the static one. This underlines the suggestion made in 
[21] that, although both energy models have been proven to be realistic, due to the 
more realistic variation of free energies connected to varying loop length, the dynamic 
model should be used for possible applications. Since at least for the dynamic model, 
the random data fit very nicely with the native data, we can conclude that structures 
generated by our non-uniform random generation algorithm behave realistic with 
respect to free energies and - as the energy of the overall structure is assumed to be 
equal to the sum of the substructure energies - rather likely also with respect to 
appearance of the different structural motifs of RNA molecules. 

Conclusion 

Altogether, we can finally conclude that the non-uniform random generation method 
proposed in this article produces appropriate output and may thus be used (for 
research issues as well as for practical applications) to generate random RNA second- 
ary structures. In fact, for any arbitrary type of (pseudoknot-free) RNA, a correspond- 
ing random sampler can be derived in the presented way. Actually, our webservice can 
be used for generating random secondary structures of any specified type of RNA. It 
just requires a database of known structures for the respective RNA type as input. 

Note that in this work, we abstract from sequence and consider only the structure 
size as input for our algorithm. Thus, an interesting problem for future research would 
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Table 4 Significance results for statistical hypothesis testing, computed by the Wilcoxon 
rank-sum method 



Chosen Value of k 


Percent 

Within C orr 

VV III III 1 1 • 

Interval 


Models Used 

for Rpiprtion 


Model for 

Native 

Energies 


Model for 

flCM IUUI 1 1 

Energies 


Resulting 

WilrnYnn n-X/aliiP 

VV MLUAUM fj ValUC 

(approx.) 


10 

3VTT 


^ 1.00504 


1 


Dynamic 
Static 
Both 
Both 


Dynamic 
Static 
Dynamic 
Static 


Dynamic 
Static 
Dynamic 
Static 


0.0008438 
1.872-10" 9 
0.000507 
1.851-10" 10 




^ 1.02598 


5 


Dynamic 
Static 
Both 
Both 


Dynamic 
Static 
Dynamic 
Static 


Dynamic 
Static 
Dynamic 
Static 


0.001567 
1.454-10" 10 
0.0002654 
1.009-10" 9 


3 


^ 1.05409 


10 


Dynamic 
Static 
Both 
Both 


Dynamic 
Static 
Dynamic 
Static 


Dynamic 
Static 
Dynamic 
Static 


0.001374 
3.526-1 0" 9 
0.0004116 
9.01 8-1 0" 10 


2 


^ 1.15470 


25 


Dynamic 
Static 
Both 
Both 


Dynamic 
Static 
Dynamic 
Static 


Dynamic 
Static 
Dynamic 
Static 


0.003618 
2.530-1 0" 7 
0.001228 
1 .1 62-1 0" 7 


72^ 


* 1.41421 


50 


Dynamic 
Static 
Both 
Both 


Dynamic 
Static 
Dynamic 
Static 


Dynamic 
Static 
Dynamic 
Static 


0.02394 
1.278-10" 6 
0.001389 
1.515 10" 7 




2 


75 


Dynamic 
Static 
Both 
Both 


Dynamic 
Static 
Dynamic 
Static 


Dynamic 
Static 
Dynamic 
Static 


0.1184 
0.001034 

0.0495 
0.0009445 






100 




Dynamic 
Static 


Dynamic 
Static 


0.4007 
0.08961 



be to find a way to extend the presented realistic SCFG model to additionally deal with 
RNA sequence. In fact, this work and especially the considered elaborate SCFG could 
mark some sort of stepping stone towards new stochastic RNA secondary structure 
prediction methods realized by statistical random sampling. 

Appendix 

How to Construct a Weighted Unranking Algorithm from a Given SCFG 

The purpose of this section is to give a rather small example for applying the construc- 
tion scheme described in detail in [20] to proceed from an arbitrary SCFG to 
reweighted normal form (RNF) and then to the corresponding weighted combinatorial 
classes which allow for non-uniform generation by means of unranking. 
Example App~.4. Let us consider the SCFG which contains the following rules: 

W\\S -+ B, 

w 2 :B -> (£>), w 3 :B -> |C, 
w 4 :C ^ e, Ws'.C -> \C. 
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To apply the approach presented in [20] to transform a given SCFG to RNF, the 
grammar needs to be e-free and loop-free. Thus, we first have to transform grammar 
Qd into the following one: 

Wi'.S — > B, 

w 2 :B -> (B),w 3 :B -> C, 
u> 4 :C — > |, Ws'.C — >> \C. 

The transformation of into RNF now works as follows: First, we have to gather all 
possible chains A — > A x — > A 2 — > ... — > a, where A * S and |a| = 1. These chains are 
B — > C, £ — > C — > | and C — » |; the rules B — > C and C — > | are then removed. Sec- 
ond, we have to replace each of these chains by a specific new rule. In fact, we have 
to add B c,e — » C, i>'' c — > | and C 1 ' 6 — > | to the new set of productions. Consequently, 
our new rule set is now given by 

Wi.S -> B, 
w 2 :B -> (£), 
9 5 :C^ |C, 
l:B c ' e -> C, 1:B'' C -> |, l:C'' e -> |. 

Third, for each occurrence of a non-terminal symbol A in the conclusion of a production 
and each previously added new rule j^,AiA 2 ... a corresponding to a chain A — > Ax — > 
A 2 — > ... — > a, add a specific new rule. This way, we obtain the following production set: 

W\\S -> B, W\ • W3S -> B c,€ , w\ • w 3 • t?4:S — ► B l,c , 
9 2 :B -> (£), 9 2 • u? 3 :B -> {B Ce ), w 2 -w 3 - w 4 :B -> (B'' c ), 
9 5 :C -> |C, 9 5 • m'.C -> |C'' e , 
l:B c ' e -> C, 1:B'' C -> |, 1:C'' € -> |. 

Fourth, each intermediate symbol that no longer occurs as premise in any of the pro- 
ductions has to be removed and fifth, each production of the form S — > a, where S is 
the axiom and \a\ > 1 has to be changed in a specific way. However, since in our case, 
there is obviously nothing left to do, the transformation of into RNF is finished. 

For Qd (in RNF), where all production weights are rational, we can determine the 
common denominator 5 of the weights of productions with premise S, as well as the 
common denominator c of the weights of the remaining productions (i.e., of the pro- 
ductions with premise B or C). Then, the reweighting of the production rules of (the 
RNF of) Qd is done by multiplying the weights of productions with source S by s, and 
the weights of the other productions A — > a, where A * S, by the factor c' a ' _1 . After 
that, we obtain the following reweighted grammar G' d : 

w[:S^B, w 2 :S^B c ' € , w f 3 :S -> £'' c , 
u/ 4 :B -> (£>), u/ 5 :B -> (£ Ce ), u/ 6 :B -> (B'' c ), 
u/ 7 :C-» \C, u/ 8 :C-> |C |e , 
l:B c ' € -> C, l:Bl' c -> |, l:C'- € -> |, 

where each 1 < i < 8, is integral. 

The (now weighted) grammar can easily be translated into a corresponding admissi- 
ble specification, which includes the weighting of all involved combinatorial (sub) 
classes, as described earlier. For the reweighted grammar G' d , this specification is given 
by the following equations: 
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Si 


= B, 




s 2 


= B c ' s , 




s 3 


= &' C , 


Bi 


= Z(xBxZ)f 










B 3 


= Z{xB 


Ci 


= Z\xC, 




c 2 


= Z\ x 0' s , 








B C,e 


= c, 




B \,c 


= Z\, 




C \,e 








s 


= w[ 


■ S\ + w' 2 ■ S 2 + u>' 3 


S 3 , 










B 


= w' 4 


■ B\ + u/ 5 ■ B 2 + u/ 6 


■B 3 , 










c 


= w' 7 


• C\ + u/ 8 ■ C 2 , 









which can be simplified in the following way: 



B\ = Z( x B x Z), 
d=Z_x C, 



B* = Z( x Z\ x Z\, 



B2 = Z( x C x Z), 

C 2 = Z_x Z_, 
S = w'\ • B + w'2 - C + w f 3 • Z\, 
B = wU • B\ + u/5 • B 2 + w f s • B 3 , 
C = u/ 7 -d +w , 8 -C 2 - 



As described earlier, this specification (with weighted classes) derived from 
reweighted grammar Q' d transforms immediately into a recursion for the function size 
of all needed combinatorial classes. For G' d , the recursion for the function size has the 
following form: 



size(Z, n) 



size(£>, n — 2) 
size(C, n — 2) 
1 

size(C, n — 1) 
1 

w\ ■ size(B, n) +w' 2 ■ size(C, n) + w' 3 • 1 

w\ ■ size(i3, n) +w' 2 ■ size(C, n) + u/ 3 • 0 

w' 4 ■ size(Bi, n) + m/ 5 ■ size(i32, n) + u/ 6 ■ size(B3 ; 

u/ r 7 ■ size(Ci, n) + m/ 8 ■ size(C2, n) 



X: 
X: 
X: 
X: 
X: 
X: 
X: 
X: 

else. 



'Hi, 
= £ 2/ 

: B3 and n = 3, 

■■ C2 and n = 2, 
■■ S and n = 1, 
■■ S and 
H, 



This recursive size function (with weighted class sizes) can now be used for the 
straightforward construction of a corresponding algorithm for the non-uniform genera- 
tion of elements of C{Qd) by means of unranking, as proposed in [20]. 



Derivation of the Algorithm 

In this section, we give a complete and detained description of the derivation of our 
weighted unranking algorithm for RNA secondary structures. The different steps are 
made according to the approach described in [20] to get an unranking algorithm that 
generates random RNA secondary structures of a given size n according to the distri- 
bution on all these structures. 
Considered (unambiguous, e-free and loop-free) SCFG 

First, note that in [21], to obtain the stochastic model for RNA secondary structures 
derived from real-world RNA data, the following unambiguous SCFG which unam- 
biguously generates exactly the language C given in Definition 0.9 has been used: 
Definition App-.ll. The unambiguous SCFG Q sto generating exactly the language C 

is given by ftto = {Ig sto r / . n > ^Gsto' S)> where 

lg m = {S, T, C, A, L, G, B, F, H, P, Q, R, J, K, M, N, U}, 
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Xg sto = {(,), |} and 7Zg sto contains exactly the following rules: 



pi:S -> TAC, 
p 2 :T -> TAC, 
p 4 :C-> C|, 
p 6 :A^(L), 
p 7 :L -> (L), 

pu.G -> (L)|, 
pi 8 :B -> B|, 
p 20 :F^ III 
p 23 :H^H|, 

P2 5 :P^I(L)L 



P2 6 :P^I(L)IL 
p 30 :Q^ IIWLWL 
p 32 :]^ |||/(L)|, 



P19-B -> 6, 

P^F^IIII, 

p 24 :H^ e, 



p 15 :G^(L)B||, 



p 8 :L -> A4, 
pu.L^F, 



p3:T^Q 
ps-.C^e, 



p 9 :L^P, p 10 :L^Q, 
pu'-L -> G, 

pi 6 :G-^ |(L), p 17 :G^ ||B(L), 



p 22 :F^ |||||H, 



P27:P^II(L)LP28:P^II(L)II, 



P29:Q^II(LWIL 
p 3 i:i?-> \{L)K\\\, 



psij^Jl 
p 35 :K^K\, 



p 34 :/ -> 6, 
p 36 :K^ 6, 



p 37 :A4 -> U{L)U{L)N, 
p 38 :N^ l/(L)N, 
p4 0 :^^ U|, 



p 39 :N 17, 
p 41 :U^e. 



In this grammar, different intermediate symbols have been used to distinguish 
between different substructures. In fact, the reason why this grammar has so many 
production rules is that the grammar must be able to distinguish between all the dif- 
ferent classes of substructures for which there are different free energy rules according 
to Turners thermodynamic model considered in [21]. 

However, as e-freeness and loop-freeness are required preliminarily, we have to con- 
sider another unambiguous SCFG generating the same language £, where we have to 
guarantee that the same substructures are distinguished as are distinguished in Q sto . 

Using the usual way of transforming a non-e-free grammar into an e-free one, the 
following definition can immediately be obtained from the previous one: 

Definition App~.12. The unambigous and e-free SCFG Q f sto generating exactly the 
language £ is given by Q' st0 = {Ig> w , Rg^, S f ), where 

J & = {S\ S, T, C, A, L, G, B, F, H, P, Q, R, J, KM, N, U} f 
^G'sto = K' )' I) an d 'R-G'sto contains exactly the following rules: 



Po 

Pi 

Ps 

P9 
P'lO 
P'l2 
P'l3 
P'l7- 
P'lO 
P'l3 
P'26 
P'l8 
P'32 

Pm 

P38 
Pll 

Pu- 

P48 : 
P50 : 
P52 : 
PS6 

P'eo 

P64- 

Pes 



:S' -> S, 

:S^A, p' 2 :S^AC, p f 3 :S ->• TA, p f 4 :S ->• TAC, 

:T -> A, p' 6 :T AC, p r 7 :T TA, p' 8 :T -> TAC, 

:T -> C, 

:C-H, p' n :C^C|, 
:A (L), 

:^(4> P'u-L^M, p\ 5 :L^P, P' 16 :L^Q, 

:L ->> R, p^:! F, p' l9 :L G, 

:G -> (L)|, p 21 :G (L)||, p 22 :G -> (L)B||, 

:G -> |(L), p 24 :G ||(L), p 25 :G -> ||B(L), 

:B -> I, p 27 :B -> B|, 

:F-H||, P 29 :F^||||, p 30 :F |||||, p' 31 :F -> |||||H, 

:H-H, p 33 :H^H|, 

=P IWI/ P35^ IWII, P36^ HWI' P37^ "> II WH. 

:Q -> 11(1)111, P 39 :Q -> IKIWII, PW-Q -+ \\\(L)W> P41-Q Ill/Wlb 

:Q^III(LWL Pi 3 :Q^ II l/WKII, 

^ IWIH, pi 5:j R KLWII, p' 46 :R -> 111(1)1, p; 7 :J^ -> 111/(1)1, 

:/-H, P' 49 :J^J\, 

:K^\, p' 51 :K^K\, 

■M -> (L)(L), // 53 :A4 -> U(L)(L), // 54 :A4 -> (L)L/(L), // 55 :A4 -> (L)(L)N, 

:M -> L/(L)L/(L), p' 57 :A4 -> U(L)(L)N, // 58 :A4 -> (L)U(L)N, p^:^ -> L/(L)U(L)N, 

:N -> (L), p' 61 :N -> L/(L), p^ 62 :N -> (L)N, p' 63 :N -> L/(L)N, 
:N-> U, 



U^l p f 66 :U^U\. 
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Unfortunately, the set of productions of G f sto contains productions with up to 5 non- 
terminal symbols in the conclusion. This is not acceptable for our purpose, for the fol- 
lowing reason: the desired unranking algorithm makes use of the size of combinatorial 
classes whose representations somehow are derived from CFGs with particular integer 
weights on their productions. If we constructed this WCFG by starting with the gram- 
mar G f sto , then this would yield a huge number of production rules. Consequently, the 
translation would imply a huge specification of the combinatorial classes and the cor- 
responding function to compute their sizes and thus the corresponding unranking 
algorithm would have to distinguish between an unnecessarily and most importantly 
unacceptably large number of cases. 

Nevertheless, the size of the production set of the weighted grammar underlying the 
desired unranking algorithm can be significantly reduced by starting with a modifica- 
tion of grammar G f sto which has only production rules with minimum possible numbers 
of non-terminal symbols in the conclusion. In fact, by transforming G f sto appropriately 
considering this observation, we obtained the SCFG g st0 : 

Definition App~.13. The unambiguous e-free SCFG Q st0 generating exactly the lan- 
guage C is given by Gsto = {Ig sto , ^g sto , % t/ S')> where 

I a = {S f , E, S, T, C, A, L, G, D, B, F, H, P, Q, R, V, W, O, /, K, M, X, Y, Z, N, U], 

ysw 

£g s(o = {(/ )/ 1) and Rg su contains exactly the following rules: 

p 2 :E -> S, p 3 :E -> SC, 
p 4 :S ->■ A, p 5 :S -»■ TA, 
p 6 :T -> E, p 7 :T -> C, 
ps.C^l p 9 :C^C\, 
p 10 :A -» (I), 

pu :L -+ A, pi 2 :L -> M, p 13 :L -+ P, p u :L -+ Q, 

pi 5 :L -+ R, p 16 :L -> F, p 17 ± -+ G, 

p ls :G -> A\, p l9 :G AD, p 20 :G -> \A, p 2l :G DA, 
p 22 :D -> B\, 

p 23 :B -> |, p 24 :B -> B\, 

p2 5 :F^\\\, fee* -Hill, p 27 :F ^ ||||H, 

p 28 :H^\, p 29 :H^H\, 

p 30 :P -+ \A\, p 3l :P |A||, p 32 :P -> ||A|, p 33 :P -+ ||A||, 

? 34 :Q^ l|0||,p 3 5:Q^ II V|, 

p 36 :R -> |0||, p 37 :R -> ||W|, 

p 38 :V -> /O, 

p 39 :W -> JA, 

p 40 :O -> AK, 

?4i:/-H, P42-J^J\, 

p 43 :K^\, p 44 :K^K\, 
p 45 :M XY, 

p 46 :X A, p 47 :X L/A, 

p 49 :Z X, p 50 :Z XN, 
p 51 :N^Z, p 52 :N^U, 
p 53 :U^\, p 54 :U^U\. 

Transforming our SCFG into RNF 

Now, we can construct the desired weighted grammar that will be underlying our 
unranking algorithm: In the first step, we gather all possible chains of productions that 
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do not lengthen the sentential form. In fact, we have to consider all rules A — > a, A * 
S', with \a\ = 1, to obtain all such chains (note that these rules will be removed after 
step 1). Hence, we have to consider the following set of 22 production rules: 

p 2 :E^S, 
%:S -> A, 

p 6 :T^E, p 7 :T^C, 
Ps'C -> |, 

pn± -> A, pi 2 :L -> M, pi 3 :L -> P, pi 4 :L -> Q, 
pi 5 :L -> R, p 16 :L -> F, p 17 :L -> G, 
p 23 :B^ |, 
Pi8-H^ |, 
?4i:/^ 1/ 

p49:Z X, 

p 5 i:N -> Z,p 52 :N -> U, 
?53^^ |. 

Thus, the following 32 chains are gathered in step 1: 

£ => S, target5[E] = {(S, A^s := p2/ ^)/ 

E ^ S ^ A, (A, A £ , A := p 2 • %, S)}, 

S^A, targets[S] = {(A, A S/A := p 4 ,e)}, 

T E, tar££*>[T] = {(£, A T , £ := p 6/ e), 
T^C, {C,k T ,c :=p 7 ,€), 

T=>C=H, (Ut,| :=?7-?8,C), 

T^E^S, {S,X T ,s :=p6-p2,E)}, 

T^E^S^A, (A, k TA := p 6 • p 2 • %, ES)}, 

C=H, targets[C] = {{\,k c ,\ :=p 8 ,e)}, 

L => A, targets [L] = { (A, k L/A : = pu , e) , 

L^M, {M,k LM :=pi2,e), 

L^P, {P,k L ,p -pure), 

L^Q, (Q, Xl,q :=?i4,<0, 

L=>R, {R,Il,r :=pis,c), 

L^F, {F,k LiF :=p 16 ,e), 

L^G, (G,A L , G :=pi 7 ,e)}, 

B=>\, targets[B] = {(|,A B ,| :=p 23 ,e)}, 

H=H, targets[H] = {(U H ,| :=?28,e)}, 

/=H, targ£*>[/] = {(U ; ,| :=p 4 i,e)}, 

K=>\, targets[K] = {(U^,, :=p 43 ,e)}, 

X => A, torget5[X] = {(A, A X;A := p 46/ e)}, 

7 => Z, torg^[y] = {(Z, Ay, z := p 48/ e), 

Y^Z^X, (X, A Y(X := p 48 • p 49 , Z), 

y ^Z^X^ A, (A, ky,A :=?48 -?49 -p^ZX)}, 

Z X, torget5 [Z] = { (X, k Z ,X ■ = ?49 / € ) , 

Z ^ X ^ A, (A, A-z,a := p 49 -?46,X)}, 

N => Z, far^[N] = {(Z, A N;Z := p 5 i, e), 

N U, (U,k S ,U--=p52,€), 

N=>U=>\, (Un,| :=P52-P53,U) / 

N Z X, (X, k NiX := p 5 i • p 49 , Z), 

N =>> Z =>- X => A, (A, A N/A := p 51 • p 49 • p 46/ ZX)}, 

U=H, tor^t5[U] = {(Uuj :=ps3,e)}. 
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Furthermore, the 22 production rules contained in R m j are now removed. This 
results in the following set Rg •= Rg sto \Knf °f 32 rules: 

p 3 :E^ SC, 
p 5 :S^ TA, 
%:C^C\, 
£ 10 :A -> (L), 

fi 8 :G^A|, pi 9 :G^AD, p 20 :G 
?2 4 :B -> B|, 

?2 5 :F^|||, p26'.F^\\\\, p 27 :F 
p 29 :H^H\, 

%o'-P^ |A|, p3i:P-> |A||, ? 32 :P 
? 34 :Q^ l|0||,p 3 5:Q^ l|V|, 
% 6 :R^\0\\, p 37 :R^ \\W\, 
p 38 :V^JO, 
p 39 :W ->}A, 
p4o:0 —> AK, 

P42'J^J\, 

p 44 :K^K\, 
% 5 :M^XY, 
UA, 

p 50 :Z^XN, 
p 54 :U^ U\. 

Additionally, in step 2, for each chain a new intermediate symbol and a new produc- 
tion are introduced. Thus, according to the 32 chains gathered in step 1, we here 
obtain the following set R^f of 32 new production rules: 

l:E s ' e -> S, l:E A ' s -> A, 
l:S A ' e -> A, 

l:T £ ' e -> E, 1:T C ' € -> C, 1:T | C -> |, 
1:T S ' £ ^S, 1:T A ' £S ^A, 
1:C'' 6 -> |, 

l:L A ' e ^A, l:L M ' e ^M l:L p ' e -> P, l:L Q ' e -> Q, 

1:L*' € -> ft, 1:L F ' € -> F, l± G ' € -> G, 

1:B>' 6 -> |, 

1:H'' 6 -> |, 

l:/'' € -> I, 

1:K"' € -> |, 

1:X A ' € -> A, 

l:Y z ' e -> Z, 1:Y X ' Z -> X, l : y A ' zx -> A, 

l:Z x ' e ^X, 
1 :Z a,x^ A/ 

1:N Z ' 6 -> Z, liN 17 ' 6 -> U, l:N lu -> |, 
l : iV x ' z ^X, 1:N A ' ZX -> A, 
l:L7'' e -> |. 

In step 3, for each occurrence of a non-terminal symbol in the conclusion of a pro- 
duction and each chain starting with this non-terminal symbol, we have to add a new 
production with the corresponding new intermediate symbol instead of the considered 



-> |A, p 2l :G^DA, 
-> IIIIH, 

-> l|A|, p 33 :P^\\A\\, 
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one. Thus, in step 3, the remaining 32 production rules from Rg '■= Rg sto \R]nf are 
transformed (according to R^ n j) into the following set R 2 § ( of 79 new rules: 

pi-.S'^E, pi • X E , S :S' -» E s *, 
Pi ■ ^e,a-S' -> E A > S , 

%:E ->• SC, p 3 • A. S/A :E ->■ S A ' € C, 
p 3 • A C ,|:£ -> SCK p 3 • A S/A • X c ,|:£ -> S A - e CK 

p 5 :S -» TA, p 5 • A, T/£ :S -> T £,f A, 

p 5 • X T , C :S -> T C,<F A, p 5 • A. T ,|:S -»■ Tl' c A, 

p 5 • ^T,s:S -> 7* £ A, p 5 ■ A T , A :S -> T A ' £S A, 

p 9 :C^C|, p 9 -X G |:C-»- 0% 

p 10 :A -> (I), p 10 • k UA :A -> (L Ae ), 

Pio • A -> pio • X L ,p:A -> (L p ' e ), 

pw ■ ^l,qA -> pio • X UR :A -> (L R,e ), 

Pio • Ai, F :A -> (L F,<F ), p w ■ X UG :A -+ (L Gf ), 
pi 8 :G -> A|, p 19 :G AD, 

p 20 :G^|A, p 21 :G^DA, 

p 22 :D^B|, p 22 -k Bi{ :D^ 

p 24 :B->B\, p 24 ■ k Bi{ :B ^ B^\, 
P25-F^\\\, P2 6 :F^||||, 

PiT-F -> IIIIH, ?27-X H ,|:F-»- IIIIHK 

p 29 :H^H|, p 29 -A H ,|:H^H^|, 
p 30 :P^|A|, p3 i:P ^|A||, 
P32:P^I|A|, p3 3: p^||A||, 
p 34 :Q -»HO||, p 35 :Q -HM 

p 36 :R^\0\\, p 37 :R^I|W|, 

p 3a :V^JO, p 38 -h\-V ^ J l€ 0, 

p 39 :W -+ /A, p 39 ■ X/,|:W -+ /^A, 

p 40 :O^AK, p 40 -^,|:O^ AK^, 

PAI-J^JI P42 ■ h\-J ^ J l 'i 

p 44 :K^K\, p 44 ■ X K ,\:K ^ KH 

p 45 :M XY, p 45 ■ X Y ,z-M ->• XY Z<F , 

p 45 ■ A. y , x :A4 ->■ XY XZ , p 45 • A. y , A :M ->• XY A ' ZX , 

p 45 • X X , A M -> X A -*Y, p 45 • a x , a • A y | z :M -> X A -*Y Ze , 
P45 • ^x,a ■ X Y , X M -> X A ^Y XZ , p 45 • A X , A • k Y , A :M -> X A '^Y A ' ZX , 

p 47 X -> L/A, p 47 • :X -> U U A, 

p 50 :Z -> XN, p 50 • A N , Z :Z -+ XN Ze , 

Pso • Xn.u-Z -> XN u - e , p 50 ■ X N ,| :Z -+ XM U , 

pso • A N ,x:Z -> XN XZ p 50 ■ a. N/A :Z -+ XN A - ZX , 

p 50 • ^x,aZ -> X A,e N, p 50 • X X , A • X N , Z :Z -> X A -^N Ze , 

Pso • • A N ,u:Z -> X A ' f N u ' e , pso • X x ,a ■ X m :Z -> X A ' e Nl' u , 

Pso • ^x,a • ^NX-Z -> X A ' f N xz , pso • A x ,a ■ A N , A :Z -> X A *N A - ZX , 

p 54 :U^U\, p 54 ■ X U ,\:U ^ U^\. 

In step 4, we must delete all intermediate symbols that no longer occur as premise. 
Obviously, intermediate symbols no longer occurring as premise of a production are 

T,L,N,Y. 



We easily observe that the productions that contain at least one of these 4 intermediate 
symbols in the conclusion and thus have to be removed are exactly the following ones: 
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p 5 :S^TA, 
pi 0 :A -> (L), 

£45^ ^XY,p45 -A. X ^:A4 
p5 0 :Z -> XN, p5o • A X/ a:Z - 



^X A ^Y, 
X A,€ N. 



Consequently, after the removal of these 6 rules from ^ ^ , there still remain 73 new 
production rules. 

Finally in step 5, we must make sure that the conclusion of all productions with pre- 
mise S f (axiom of Q si0 that we started with) does not have a length greater than 1. 
However, since there is only one production with premise S' in our start grammar g sW 
and the conclusion of this production has size 1, there is nothing to do. Thus, the 
resulting new grammar is given by: 

Definition App~.14. The WCFG Q* generating exactly the language C is given by 

S* to = {I ^ U4 ,R& UR f ~ ,S'), where 

bLU v y$to Gsto y&t ° y&t ° ytto 



{S f , E, S, C, A, G, D, B, F, H, P, Q, R, V, W, O, J,K,M, X, Z, U}, 



{E s ' e ,E A ' s ,S A ' € , 

j^E,€ ^S,E >-pA,ES £\,€ 

L A,€ , L M,€ , L p,€ , L^ ,€ , L R,€ , L F,€ , L G,€ , 
B l€ ,H l€ ,J l€ ,K l€ , 

^A,e y z ' € y x,z Y A,ZX Z X,<E Z A,X 

{(/ )/ 1} and Rg* w contains exactly the following rules: 



ki:S f - 

X 4 '.E — 

X 8 :S- 
Xn'.S — 
X U :C- 
X 15 A- 
X 19 A- 
X 22 :G - 
X 26 :D- 
X 2 8'-B - 
X 3 oF — 
X 34 :H - 

X 40 :Q - 
X 42 :R - 
X 44 :V ~ 
X 46 :W- 

x 48 -.o- 

X 50 :J -> 
X$ 2 :K - 
Xs4'.M - 
ks7'.M - 

^60 X - 
^62 Z - 

A. 6 5iZ - 
X 67 Z - 
^70 'Z - 
X 72 :U- 



> E, 
SC, 

T E,e A, 
T S ' E A, 

(in 

• (in 

►A|, 
>B\, 
B\, 

III, 

>H\, 
\A\, 

> \\o\i 
\o\\, 

>}0, 
>/A, 

> AK, 
Jl 

>K\, 

> XY Z ' e , 

> X A,e Y z,€ 

- UA, 

• XN Ze , 
■ XN X ' Z , 

- X A ' e N Ze , 



A 2 :S' - 
A. 5 :E- 
A,9iS — 
A.i2:S - 
X\ 4 \C - 

X 16 A - 
X 20 A - 
X 23 :G - 
X 27 :D - 
X 29 \B - 
X 3 i:F- 
X 35 :H- 
X 37 :P - 

A.4i:Q- 
X 43 :R- 
X 45 :V- 
X 47 :W 
X 49 :0 - 

A-si:/- 
X 53 :K- 
X 55 :M ■ 
X 58 :M - 
Xei'X - 

x§ 3 z - 
x^z - 

Xe§Z - 
X 7 \Z - 



X 3 :S' 
X 6 :E- 
Xw'.S - 



Xu 
X 2 i 

x 24 



X 32 :F - 

k38F- 



> E s ' e , 

> S A ' e C, 

> T c ' e A, 

> T A ' ES A, 

> d> € \, 

> 

> AD, 

Nil 

> H^\, 

> \A\\, 

> \W\, 

> l|W|, 

> J^O, 
-> J U A, 

> AK^, 

jH 

*XY X ' Z , X 56 :M- 
+ X A ' € Y XlZ , X 59 :M ■ 

> L/ U A, 

>XN u ' e , X 64 Z- 

> XN A - ZX , 

>X A ' e N U ' € , X 69 Z- 

> x A ' e N A - zx , 

> u l€ \, 



> E A ' S , 



• (L G ' e ), 
HA, 



HUH, 

► IIAI, 



A.i 8 :A - 
X 25 :G 

X 33 F - 
X 39 :P- 



v DA, 

IIIIHl e , 
• l|A||, 



> XY A,ZX 

> X A,e Y A,zx 
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whereas ^'g* contains exactly the following rules: 



X 7 4'.E S,€ 



s, 

X 76 :S A ' € -> A, 
A 77 :T £ ' 6 -> E, 

A 82 :C^ |, 
A 83 :lA e -+ A, 
X 87 :L R ' e -> J?, 
A 9 o:B'' f -> |, 
l9i:H^ |, 
X 92 :/I' e -+ |, 
A. 93 :Kl' e |, 
X 94 :X A ' € A, 
^ 95 :Y z ' e Z, 

^ioo-N Z,e —> Z, X ioi 

^103-N X ' Z ^104 

W^ 1 ' 6 -> I. 



A. 78 :T c ' e - 
A 81 :T^ s 



A, 

* C, 

* A, 

► M, 
F, 



P, 

G, 



A 86 :L Q ' e 



A96 
A.99 



yX,Z _ 

Z A,X — 
N u,€ - 

N A,ZX 



X, X 97 :Y A ' ZX -> 
A, 

- U, k 102 :N^ u -> I 
-> A, 



A, 



Reweighting the Production Rules 

Now, the weights of the 73 production rules given in the subset of productions Rg* o 
have to be reweighted. In order to achieve this goal, we first have to compute the two 
common denominators 5 and c, where 5 is the common denominator of the weights of 
productions with premise S' (i.e., of productions number 1 to 3), and c is the common 
denominator of the weights of the remaining productions (i.e., of productions number 

4 to 73) of Rg* o . Using the rounded probabilities (weights) for the production rules of 
Q* as given in Table 5, we immediately find the smallest common denominators to be 

5 = 10,000 and c = 10,000. 

The desired new weights for the considered set of productions Rg* w are then com- 
puted by multiplying the old weights of productions with source S' by 5, and by multi- 
plying the old weights of productions A a, A * S f (and A e Ig*J, by c' a|1 . 

Formally, for the reweighted set of productions Rg* to , we get the following weights: 

/x f := k{ -s, fori e {1,2,3} 



and 



fii := Xi ■ c 



\OL { \-\ 



where X\ : A\ -> a\, for i e {4, 73}. 



The resulting integer weights can be found in Table 6. 
Transforming Reweighted Grammar into Admissible Specification 

Given the reweighted grammar <J* o , we immediately obtain the following admissible 
specification of the corresponding combinatorial classes (note that this specification 
has already been simplified by removing classes that are only duplicates of others): 
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Si = S x C, 


S 2 = AxC, 






S3 = S X CL\, 


S 4 = A x a\, 






Si = S x A, 


S 2 =CxA, 


Ss = 


a\ x A, 


S4 = S x A, 


S 5 = Ax A 






Ci = C x a\, 


C2 = ot\ x a\, 






Ai = c*( x A x of), 


Ai = c*( x M x a), 


As = 


a?( x V x 


Ai = x Q x c^, 


As = Qf( X 1Z X (2), 


A 6 = 


a?( x T x 


A7 = Ctf( x 5 x 








01 = A X Qf|, 


g 2 = a x V, 






03 = a\ x A 


G 4 = VxA, 






Di = B x of|, 


T> 2 = a\ x a\, 






B\ = B x a\, 


B2 = d\ X 0t\, 






J- 1 = a\ x a\ x a\, 


T2 = a\ x a\ x a\ x ot\, 


•F 3 = 


OL\ X OL\ X 


J- 4 = a\ x a\ x a\ x a\ x a\, 








%i = H x a\, 


%2= u\ x a\, 






Vi = ot\ x Ax 0L\, 


Vl= 0i\ x A X CL\ X CL\, 


v 3 = 


OL\ X 0L\ X 


V4 = a\ x a\ x A x a\ x a\, 








Qi = a\ x a\ x O x a\ x a\, 


Qi = ot\ x ot\ x V x a\, 






1Z\ = a\ x O x a\ x a\, 


IZ2 = ot\ x a\ x >V x ot\, 






Vi = JxO, 


V 2 = a\x O, 






Wi = J x A, 


W 2 =a\x A 






0i = A x /C, 


G 2 = A x a\, 






Jl = J X 0l\, 


J2 = a\ x a\, 






JCi = JC x a\, 


JC 2 = a\ x a\, 






Mi = X x Z, 


M 2 = X x X, 


M 3 


= x A 


M 4 = AxZ, 


Ms = A x X, 


M 6 


= A x A 


Xi = U x A 


X2 = 0i\ x A, 






Zl = A X 


= X x LA, 


Z 3 = 


X x a\, 




Zs = X x A, 






Z 6 = A x Z, 


Z 7 = A x U, 


Z 8 = 


A x 


Z 9 = A x 


Zio = Ax A 






Ui = U X 0L\, 


Uj = 0i\ X 0L\, 






S' = ill ■ £ + /x 2 • S + /X3 • A 








£ = 114 • £1 + lis • £2 + ^6 • £3 + H7 • £<. 








S = fi B ■ Si + /x 9 • <S 2 + 10 • <S 3 + /xn • 


<$4 + Ml2 • ^5/ 






C = ^ 13 • Ci + fiu ■ C2, 








A = jXis • Al + /Xi 6 • A 2 + ^17 • -4.3 + ^18 • Aa + M19 ■ A 5 + /X 20 ■ A + A*21 ■ ^7/ 




£ = /X22 • <?1 + M23 • ^2 + ^24 • Q3 + ^25 • Qa, 






T> = fl 2 6 ■ T>i + /X 2 7 • Z>2, 








B = /x 28 • 0! + /x 29 • #2, 








J" = ^30 • 7"i + /x 3 i • 7" 2 + /x 32 • J" 3 + /x 33 • Ti, 






% = /X34 • Hi + /X35 • "H 2 , 








V = /X 36 • Pi + /X37 • P 2 + /X 38 • P 3 + ^39 • V4, 







Q = /X40 • Qi + M4i • Qi, 

1Z = IX42 • Tl\ + M43 • 1^2, 

V = /x 44 • Vi + /x 45 • V 2 , 

W = /X 4 6 • Wi + /X47 • W 2 , 
0 = /X 48 • Ci + /X49 • 0 2 , 
J = At50 • i7l + M51 • J2, 
= f^52 ' ICl + M53 " ^-2a 

M = /x 54 • Xi + /X55 • A4 2 + /x 56 • At 3 + M57 • M 4 + /x 5 8 • M 5 + /X59 • A4 6 , 

X = /X60 • ^1 + M-61 - ^2/ 

Z = /X6 2 • Zi + /X63 • -^2 + M-64 ' 2>3 + M-65 - ^4 + At66 ' ^5 + M67 - ^6 + At68 - ^7 + M-69 - -2^8 + M-70 - ^9 + /X71 • Zio, 
U = H 72 ■ Ui + /X73 • W 2 . 



Now, this (simplified) specification can easily be transformed into the following recur- 
sive form for the function size: 
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Table 5 Floating point approximations of the probabilities (weights) X;, l<i<73, for the 
production rules of the grammar g* o (rounded to four decimal places) 

Nonterminal Nt Weights of Rules with Premise Nt 



S' 


h : 


= 1.0000, 


X 2 : 


= 0.0212, 


^3 : 


= 0.0003, 






E 


X 4 '. 


= 0.9788, 


^5 : 


= 0.0134, 


X 6 : 


r\ r\r\ a a 

- 0.0944, 


X 7 : 


= 0.0013, 


S 


X 8 : 


= 0.8559, 


Xg '. 


= 0.1304, 




: = 0.0126, 


A,] ] 


= 0.0181, 




X ]2 


= 0.0002, 














C 


^13 


= 0.9036, 


X ]4 


= 0.0871, 










A 


^15 


= 0.7630, 


^16 


= 0.0402, 


X ]7 


: = 0.0186, 


^18 


= 0.0367, 




X ]g 


= 0.0072, 


^20 


= 0.0858, 


X 2 ] 


t~\ r\ a 0 a 

. = 0.0484, 






G 


X 22 


= 0.3038, 


^23 


= 0.1884, 


X 24 


: = 0.3081, 


^-25 


= 0.1996, 


D 


^26 


= 1.0000, 


X 2 7 


= 0.3896, 










B 


^28 


= 0.6104, 


X 2 9 


= 0.2378, 










F 


^30 


= 0.0575, 


^31 


= 0.3409, 


^32 


: = 0.6016, 


X33: 


= 0.1 21 1 , 


H 


^34 


= 0.7987, 


^35 


= 0.1608, 










P 


^36 


= 0.1085, 


^37 


= 0.2144, 


^38 


: = 0.201 1, 


^39 


= 0.4760, 


Q 


X-40 


= 0.1713, 


X 4 ] 


= 0.8287, 










R 


X 42 


= 0.4150, 


^43 


= 0.5850, 










V 


X 44 


= 1.0000, 


^45 


= 0.3243, 










W 


X 46 


= 1.0000, 


X 47 


= 0.3243, 










0 


X 48 


= 1.0000, 


X 4 g 


= 0.2928, 










J 


^50 


= 0.6757, 


^51 


= 0.2191, 










K 


^52 


= 0.7072, 


^53 


= 0.2071, 










M 


^54 


= 1.0000, 


^55 


= 0.0510, 


^56 


: = 0.0036, 


X57: 


= 0.0712, 




^58 


= 0.0036, 


^59 


= 0.0003, 










X 


^60 


= 0.9288, 


^61 


= 0.1968, 










Z 


^62 


= 0.4211, 


^63 


= 0.5279, 


X 64 


: = 0.1119, 


^65 


= 0.0215, 




^66 


= 0.0015, 


^67 


= 0.0300, 


^68 


: = 0.0376, 


X 6 g 


= 0.0080, 




^70 


= 0.0015, 


X 7 ] 


= 0.0001, 










u 


X 72 


= 0.7881, 


^73 


= 0.1670. 











Note that for \ £ {74, 105}, k{ := 1 holds. 



Hi ■ size(£, n) + /jL 2 • size(<S, ri) + ^3 


size(A n)Z = S' , 




size £ (X, n) 


Xg {Ei\l < i 


< 4} orX = £, 


sizes (X, ri) 


Xg {Si\l < i 


< 5} orX = <S, 


sizec(X ri) 


Ze{d\l < i 


<2} orX = C, 


s\z£ A (Z, ri) 


Xe < 


< 7} orX = A, 


sizec(X, n) 


Ze{Qi\l<i 


< 4} orX = Q, 


size D (X, ri) 


Xg {All < 1 


< 2} or X = £>, 


size B (X, ri) 


Te{Bi\l<i 


< 2} or X = B, 


size^X, n) 


Te{Ti\l<i 


< 4} or X = T, 


sizen(X, ri) 


Xg {Hf|l < 


i<2} or 1 = K 


sizep(X, ri) 


Xg {Pi|l < 1 


<4} orX = P, 


S\Z£q(Z, ri) 


Te{Qi\l < 


■<2} orX=Q, 


size R (X, ri) 


Xg {^|1 < 


i<2}orX = 7£, 


sizey(X, n) 


XG{Vi|l <i 


< 2} or X = V, 


size w (X, ri) 


Xg {Will < 


i<2] orX = W, 


sizeo(X, ri) 


Te{Oi\l < 


< 2} orX = (9, 


size/(X, ri) 


Te{Ji\l<i 


<2} or X = J, 


size^(X, ri) 


Xg {/Q|l < 1 


<2} orX = /C, 


size M (X, ri) 


Xg {A4i|l < 


i < 6} orX = A4, 


sizex(X,n) 


Xg {Xi\l < 1 


< 2} or X = Af, 


size z (X,n) 


Ie{Zi\l<i 


< 10} or X = 


sizeu(X, ri) 


Xg {^i|l < i 


< 2} or X = W, 


0 


else, 
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Table 6 Integer weights /i if l<i<73, for the production rules of the grammar 



Nonterminal Nt Integer weights of Rules with Premise Nt 



J 


/j] : 


— 1 nnnn 


fj 2 '■ 


= 212 




^3 : 


— 3 






[= 


U 4 : 


— Q788 

— y / 00, 


fj 5 : 


— 1 34 




^6 : 


= 944 


[X-/ : 


= 13 


s 




= 8559, 


1 In, " 

H9 ■ 


= 1304, 




^10 


= 126 


Ml 1 


= 1 81 




^12 


- 2 






r 
<- 


^13 


— QD36 




— 871 

— 0/ i , 


/\ 


^15 


— 76300000 

— / UjUUUUU, 


Ml 6 


— 4070000 




/ y-, 


- 1860000 


H18 


- 3670000 




^19 


— 77nnnn 

— /zuuuu, 


M20 


— 8^80000 




^21 


— 4840000 






(3 


^22 


— 3038 


^23 


— 1 884 




^24 


— 3081 


^25 


— 1 QQ6 


D 


1^26 


= 10000 


1^27 


= 3896 


g 


^28 


= 61 04 


1^29 


= 2378 


p 


1^30 


— S7S0000 


/^31 


— 340Q00000000 




1^32 


— 6016000000000000 


^33 


— 1711000000000000 

— I Z I I uuuuuuuuuuuu, 


H 


^34 


= 7987 


^35 


= 1608 


p 


1^36 


- 108S0000 


^37 


- 714400000000 




1 Voo 

P38 


= 201 100000000, 


/ Von 

P39 


= 4760000000000000, 


Q 


M40 


= 1713000000000000, 


M41 


= 828700000000, 


p 


/^42 


- 41 sooooooooo 


^43 


- S8S000000000 


v 




= 10000 


^45 


= 3243 


w 


/ 1 ,s 
H46 


= 10000, 


/ y a -7 
H47 


= 3243, 


o 


/ y /in 

M48 


= 10000, 


/ y^ 
M49 


= 2928, 


J 


M50 


= 6757, 


M51 


= 2191, 


K 


/ in 

H52 


= 7072, 


/ in 

H53 


= 2071, 


M 




= 10000, 


^55 


= 510, 




^56 


= 36, 


^57 


= 712, 




^58 


= 36, 


^59 


= 3, 


X 


1^60 


= 9288, 


^61 


= 1968, 


Z 


^62 


= 4211, 


^63 


= 5279, 




1^64 


= 1119, 


^65 


= 215, 




1^66 


= 15, 


^67 


= 300, 




^68 


= 376, 


1^69 


= 80, 




^70 


= 15, 


|L/71 


= 1, 


U 


/J72 


= 7881, 


^73 


= 1670. 



Note that for i £ {74, 105}, \L{ '= 1 holds. 



where 



size£(Z,n) : 



EjLi size(<S,j) • size(C, n - j) X = 8 lt 

size(Aj) • size(C,n -j) X = 8 2 , 

size(tS,n— 1) X = 83, 

size(An— 1) X = 84, 

/x 4 • size(£i, n) + /X5 • size(£ 2 / n) + /x 6 • size(£ 3/ n) + /x 7 • size(£ 4/ n) X = 8, 

0 else, 
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sizes(Z, n) := 



Ylj=i size(£,j) • size(An -j) 
X^" 1 size(Cj) • size(An-j) 
size(A n — 1) 

J]^ 1 size(<S,j) • size(An -j) 
size(Aj) • size(An - j) 
/Xs • size(<Si, ri) + fie, • size(c>2, n) + /xio • size(<S3, ri) 
+/xn • size(c>4, n) + /X12 • size(<Ss, n) 
0 



X = Si, 
T = S 2 , 
T = S 3l 
X = S 4 , 
T = S 5l 

X = S, 
else, 



sizec(X, n) 



size(Cn-l) X = Ci, 

1 X = C 2l and n = 2, 

/X13 • size(Ci, n) + /x i4 • size(C 2 , n) X = C, 

0 else, 



s2.ze A {X, n) := 



size G (I, n) 



size(An - 2) 






X = Ai 


size(.M, n - 2) 






1 = A 2 


size(V, n — 2) 






T = A 3 


size(Q, n — 2) 






T = A 4 


size(7£, n — 2) 






T = A 5 


size^, n — 2) 






I = A 6 


size(£,n - 2) 






X = A 7 


/xi5 • size(^li, n) + /xi 6 • size(^ 2 , n) + /xi 7 • 


size(*4 3 , 


n) + /xis • size(*4 4 , n) 




+/xi 9 • size(*4 5 , n) + /x 20 • size(^4 6 , n) + /x 2 i 


• size(*4 7 




1 = A, 


0 






else, 


size(An - 1) 






T=Gi 


size(Aj) • size(£>, n — j) 






T = Q 2 


size(A n - 1) 






x = g 3 


size(£>,j) ■ size(A n - j) 






T = Ga 


(jl 2 2 ■ size(£i, n) + /x 23 ■ size(Q 2 , n) + /x 24 


size(^ 3 


n) + /x 25 • size(£ 4 ,n) 


T = G, 


.0 






else, 



sizeo(X, n) := 



size(S,n-l) T = V lf 

1 X = V 2 and n = 2, 

/x 26 • size(2\ n ) + /z 27 • size(£> 2 ,n) X = £>, 

0 else, 



sizefi(X,n) := 



size(S,n-l) X = Si, 

1 X = B 2 and n = 2, 

/x 2 8 • size(Si, n ) + fi 2 9 • size(S 2 ,n) X = B, 

0 else, 



size^X, n) := 



size(H, n — 4) 
1 

/X30 • size(X\, n) + /X31 • size(X*2, n) 
+/X32 • size(X*3, n) + /x 33 • size(X*4, ri)X = T, 
0 else, 



X = T\ and n = 3, 
X = F 2 and n = 4, 

X=X3, 

X = X4 and n = 5, 
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sizen(X, n) := 



size(H,n-l) X = Hi, 

1 X = H 2 , and n = 2 

/x 34 • size (Tii, + /x 35 • size(H2, n) X = H, 

0 else, 



size P (X, n) : 



size(An-2) X = Pi, 

size(An-3) X = P 2 , 

size(An-3) X = P 3/ 

size(A n - 4) X = P 4 , 

/x 36 • sizefPi, n) + /x 37 • size(V 2 f n) + /x 3 s • size(P 3 , n) + /x 39 • sizefT^, n)X = V, 
0 else, 



sizeQ(Z, n) 



size(0,n-4) Z=Qi, 
size(V,n-3) X = Q 2 , 

/x 40 • size(Qi,„) + /x 4 i • size(Q 2/ n) X = Q, 
0 else, 



size#(X, n) := 



size(0,n-3) X = ^i, 

size(W,n-3) X = 7l 2 , 

/X41 • size(7^i /n ) + /X43 • size(7^ 2/ n) X = 1Z, 
0 else, 



sizey(X, n) 



Ej 1 " 1 size(J, j) • size(a n - j) X = V 
size(0,n-l) X = V 2 , 

/X44 • size(Vi, n) + /X45 • size(V2, ri) X = V, 
0 else, 



sizew{X, n) 



size( J, j) • size( A n - j) X = Wi 
size(An-l) X = W 2 , 

/x 46 • size(>Vi, n) + /x 47 • size(>V2, n) X = W, 
0 else, 



sizeo(X, n) 



X^" 1 size(Aj) • size(/C, n - j) X = O x 
size(An-l) X = G 2 , 

/x 48 • size((9i, n) + /x 49 • size(C>2, n) X = 0, 
0 else, 



size/(X, n) := 



size(J,n-l) X = Ji, 

1 X = J 2 and n = 2, 

/x 50 • size(Ji, n) + /x 5 i • size(j72, n)X = J 

0 else, 



size^(X, n) := 



size(/C, n — 1) X = K\, 

1 X = J 2 and n = 2, 

/X52 • size(/Ci, n) + /X53 • size(/C2, ri)X = K, 

0 else, 
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size M (X,n) := 



XJ" 1 s±ze{X,j) • size(Z,n-j) T = Mi, 

Yljli s±ze{X,j) • size(X,n-j) X = M 2 , 

EjL'i 1 size{X,j) • size(A n-j) X = M 3 , 

X^ 1 size(Aj) • size(Z, n-j) X = 7W 4 , 

EjL'i 1 size(Aj) • size{X,n-j) X = M 5 , 

X"" 1 size(Aj) • size(A n-j) X = M 6 , 
IX54 • sizefA^i, n) + ^55 • size(A^2/ n) + /xs 6 • size(A^3, n) 

+/jL 57 • sizefA^, n) + /x 58 • size(A^5, n) + ^59 • size(A^6/ ft) X, = M, 



0 



else, 



sizex(T, n) 



Ejll 1 size(W, j) • size( A n-j) X = #1 
size(A n — 1) X = <Y 2/ 

/X60 • size(A , i, n) + /x 6i • size(A , 2 , n) X = Af, 
0 else, 



Y^j=i size(X,j) • size(Z, n — j) X = 2i, 

EjL"/ size(X,j) ■ size(W, n-j) X = Z 2l 

size(X, n — 1) X = Z 3/ 

size(Af,j) • size{X, n-j) X = Z 4 , 

size{X,j) • size(A n-j) X = Z 5 , 

EjTi 1 size (Aj) • size(Z, n-j) X = Z 6/ 

i z (X, n) := EjTi 1 size (A j) • size(W, n-j) X = Z 7 , 

size(A n — 1) X = iT 8/ 

EjTi 1 size (A j) • size (A n-j) X = Z 9 , 

EP 1 size (A j) • size (A n-j) X = Z l0 , 
/x 62 • size(Z!, n) + /x 63 • size(2" 2/ n) + /x 64 • size(2" 3 , n) + /x 65 • size(Z 4/ n) 
+/i66 • size(Z5, n) + /i67 • size(Z 6/ n) + /i68 • size(Z 7/ n) + /i69 • size(Z 8/ n) 

+/x 7 o • size(Z 9/ n) + /x 7 i • size(Zi 0/ n) X = Z, 

0 else, 



size{j(X, n) 



size(Z//, n — 1) X = U\, 

1 X = ZY 2 , and n = 2 

/x 72 • size(7Yi, n) + /x 73 • size(ZY 2 , n)X = U, 

0 else. 



From those recurrences, the desired algorithm can easily be constructed. As the 
complete presentation of this algorithm would be too comprehensive, we decided to 
omit it and instead refer to Algorithms 1 to 4 and 6 given in [20], since for the con- 
struction of our unranking algorithm, we had to use exactly these Algorithms as 
subroutines. 
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